Microarchitectural Performance Characterization of Irregular GPU Kernels

undefined
 
Microarchitectural
Performance Characterization of
Irregular GPU Kernels
 
Molly A. O’Neil and Martin Burtscher
Department of Computer Science
Introduction
 
GPUs as general-purpose accelerators
Ubiquitous
 in high performance computing
Spreading in PCs and mobile devices
Performance and energy efficiency benefits…
…when code is well-suited!
Regular
 (input independent) vs. 
irregular
 (input
determines control flow and memory accesses)
Lots of important irregular algorithms
More difficult to parallelize, map less intuitively to GPUs
Microarchitectural Performance Characterization of Irregular GPU Kernels
2
 
Outline
 
Impact on GPU performance characteristics of…
Branch divergence
Memory coalescing
Cache and memory latency
Cache and memory bandwidth
Cache size
 
First, review GPU coding best practices for good
performance
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
3
Best Practice #1: No Divergence
Microarchitectural Performance Characterization of Irregular GPU Kernels
4
To execute in parallel, threads in a warp must
share identical control flow
If not, execution serialized into smaller groups of
threads that do share control flow path 
 
branch
divergence
 
Best Practice #2: Coalescing
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
5
 
Memory accesses within a warp must be
coalesced
Within a warp, memory references must fall within
the same cache line
If not, accesses to additional lines are serialized
 
Best Practice #3: Load Balance
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
6
 
Balance work between warps, threads, and
thread blocks
 
All 3 difficult for irregular codes
Data-dependent behavior makes it difficult to assign
works to threads to achieve coalescing, identical
control flow, load balance
Very different from CPU code considerations
 
Simulation Study
 
Want to better understand irregular apps’
specific demands on GPU hardware
To help software developers optimize irregular codes
As a baseline for exploring hardware support for
broader classes of codes
GPGPU-Sim v3.2.1 + a few extra perf. counters
GTX 480 (Fermi) configuration
Added configuration variants to scale latency,
bandwidth, cache size, etc.
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
7
 
Applications from LonestarGPU Suite
 
Breadth-First Search (
BFS
)
Label each node in graph with
min level from start node
Barnes-Hut (
BH
)
N
-body algorithm using
octree to decompose space
around bodies
Mesh Refinement (
DMR
)
Iteratively transform ‘bad’
triangles by retriangulating
surrounding cavity
 
Minimum Spanning Tree
(
MST
)
Contract minimum edge
until single node
Single-Source Shortest
Paths (
SSSP
)
Find shortest path to each
node from source
 
8
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
Semi-Regular
FP Compression (
FPC
)
Lossless data compression for DP
floating-point values
Irregular control flow
Traveling Salesman (
TSP
)
Find minimal tour in graph using
iterative hill climbing
Irregular memory accesses
Regular
N
-Body (
NB
)
N
-body algorithm
using all-to-all force
calculation
Monte Carlo (
MC
)
Evaluates fair call price
for set of options
CUDA SDK version
Microarchitectural Performance Characterization of Irregular GPU Kernels
9
Applications from Other Sources
 
Inputs result in working set ≥5 times default L2 size
 
Application Performance
 
 
Peak = 
480
 
IPC
 
As expected, regular
mostly means better
performing
BH is the exception: primary kernel regularized
Clear tendency for lower IPCs for irregular codes
But no simple rule to delineate regular vs. irregular
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
10
Branch Divergence
Microarchitectural Performance Characterization of Irregular GPU Kernels
11
 
Active instructions
at warp issue
32 = no divergence
Only one code <50%
occupied
 
Theoretical speedup
Assumes each issue
had 32 active insts.
Memory Coalescing
Microarchitectural Performance Characterization of Irregular GPU Kernels
12
 
Avg # of memory
accesses by each
global/local ld/st
>1 = uncoalesced
 
Percentage of stalls
due to uncoalesced
accesses
Provides an upper
bound on speedup
 
Memory Coalescing
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
13
 
New configuration to artificially remove pipeline stall
penalty from non-coalesced accesses
With no further improvements to memory pipeline, with
increased-capacity miss queues and MSHRs
Not intended to model realistic improvement
 
L2 and DRAM Latency
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
14
 
Scaled L2 hit and DRAM access latencies
Doubled, halved, zeroed
Most benchmarks more sensitive to L2 latency
Even with input sizes several times the L2 capacity
 
Interconnect and DRAM Bandwidth
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
15
 
Halved/doubled interconnect (L2) bandwidth and
DRAM bus width
Benchmark sensitivities similar to latency results
L2 large enough to keep sufficient warps ready
Cache Behavior
Microarchitectural Performance Characterization of Irregular GPU Kernels
16
 
Very high miss ratios
(generally >50% in L1)
 
Irregular codes have
much greater MPKI
BFS & SSSP: lots of
pointer-chasing, little
spatial locality
 
Cache Size Scaling
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
17
 
Halved, doubled both (data) cache sizes
Codes sensitive to interconnect bandwidth are also
sensitive to L1D size
BH tree prefixes: L2 better at exploiting locality in traversals
Most codes hurt more by smaller L2 than L1D
Individual Application Analysis
Microarchitectural Performance Characterization of Irregular GPU Kernels
18
 
Large memory
access penalty
in irregular apps
 
Divergence
penalty less than
we expected
 
Synchronization
penalty also below
expectation
 
Regular codes have
mostly fully-
occupied cycles
 
Computation
pipeline hazards
(rather than LS)
Conclusions
 
Irregular codes
More load imbalance, branch divergence, and
uncoalesced memory accesses than regular codes
Less branch divergence, synchronization, and atomics
penalty than we expected
Software designers successfully addressing these issues
To support irregular codes, architects should
focus on improving memory-related slowdowns
Improving L2 latency/bandwidth more important than
improving DRAM latency/bandwidth
Microarchitectural Performance Characterization of Irregular GPU Kernels
19
 
Questions?
 
Acknowledgments
NSF Graduate Research Fellowship grant 1144466
NSF grants 1141022, 1217231, and 1438963
Grants and gifts from NVIDIA Corporation
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
20
 
Related Work
 
Simulator-based characterization studies
Bakhoda et al. (ISPASS’09), Goswami et al. (IISWC’10), Blem et
al. (EAMA’11), Che et al. (IISWC’10), Lee and Wu (ISPASS’14)
CUDA SDK, Rodinia, Parboil (no focus on irregularity)
Meng et al. (ISCA’10) – dynamic warp hardware modification
PTX emulator studies (also SDK, Rodinia, Parboil)
Kerr et al. (IISWC’09) – GPU Ocelot, Wu et al. (CACHES’11)
Hardware performance counters
Burtscher et al. (IISWC’12) – LonestarGPU, Che et al. (IISWC’13)
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
22
 
Input Sizes
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
23
 
Secondary Inputs
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
24
 
GPGPU-Sim Configurations
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
25
 
Issue Bin Priority
 
 
Microarchitectural Performance Characterization of Irregular GPU Kernels
 
26
Slide Note
Embed
Share

GPUs are widely used for high-performance computing, but irregular algorithms pose challenges for parallelization. This study delves into the microarchitectural aspects affecting GPU performance, emphasizing best practices to optimize irregular GPU kernels. The impact of branch divergence, memory coalescing, cache and memory latency, bandwidth, and size are discussed. By understanding these factors, software developers can enhance the efficiency of irregular GPU codes.

  • GPU Performance
  • Irregular Kernels
  • Microarchitecture
  • Parallelization
  • Optimization

Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Microarchitectural Performance Characterization of Irregular GPU Kernels Molly A. O Neil and Martin Burtscher Department of Computer Science

  2. Introduction GPUs as general-purpose accelerators Ubiquitous in high performance computing Spreading in PCs and mobile devices Performance and energy efficiency benefits when code is well-suited! Regular (input independent) vs. irregular (input determines control flow and memory accesses) Lots of important irregular algorithms More difficult to parallelize, map less intuitively to GPUs Microarchitectural Performance Characterization of Irregular GPU Kernels 2

  3. Outline Impact on GPU performance characteristics of Branch divergence Memory coalescing Cache and memory latency Cache and memory bandwidth Cache size First, review GPU coding best practices for good performance Microarchitectural Performance Characterization of Irregular GPU Kernels 3

  4. Best Practice #1: No Divergence To execute in parallel, threads in a warp must share identical control flow If not, execution serialized into smaller groups of threads that do share control flow path branch divergence Microarchitectural Performance Characterization of Irregular GPU Kernels 4

  5. Best Practice #2: Coalescing Memory accesses within a warp must be coalesced Within a warp, memory references must fall within the same cache line If not, accesses to additional lines are serialized Microarchitectural Performance Characterization of Irregular GPU Kernels 5

  6. Best Practice #3: Load Balance Balance work between warps, threads, and thread blocks All 3 difficult for irregular codes Data-dependent behavior makes it difficult to assign works to threads to achieve coalescing, identical control flow, load balance Very different from CPU code considerations Microarchitectural Performance Characterization of Irregular GPU Kernels 6

  7. Simulation Study Want to better understand irregular apps specific demands on GPU hardware To help software developers optimize irregular codes As a baseline for exploring hardware support for broader classes of codes GPGPU-Sim v3.2.1 + a few extra perf. counters GTX 480 (Fermi) configuration Added configuration variants to scale latency, bandwidth, cache size, etc. Microarchitectural Performance Characterization of Irregular GPU Kernels 7

  8. Applications from LonestarGPU Suite Breadth-First Search (BFS) Label each node in graph with min level from start node Barnes-Hut (BH) N-body algorithm using octree to decompose space around bodies Mesh Refinement (DMR) Iteratively transform bad triangles by retriangulating surrounding cavity Minimum Spanning Tree (MST) Contract minimum edge until single node Single-Source Shortest Paths (SSSP) Find shortest path to each node from source Microarchitectural Performance Characterization of Irregular GPU Kernels 8

  9. Applications from Other Sources Semi-Regular FP Compression (FPC) Lossless data compression for DP floating-point values Irregular control flow Traveling Salesman (TSP) Find minimal tour in graph using iterative hill climbing Irregular memory accesses Regular N-Body (NB) N-body algorithm using all-to-all force calculation Monte Carlo (MC) Evaluates fair call price for set of options CUDA SDK version Inputs result in working set 5 times default L2 size Microarchitectural Performance Characterization of Irregular GPU Kernels 9

  10. Application Performance Peak = 480 IPC As expected, regular mostly means better performing BH is the exception: primary kernel regularized Clear tendency for lower IPCs for irregular codes But no simple rule to delineate regular vs. irregular Microarchitectural Performance Characterization of Irregular GPU Kernels 10

  11. Branch Divergence Active instructions at warp issue 32 = no divergence Only one code <50% occupied Theoretical speedup Assumes each issue had 32 active insts. Microarchitectural Performance Characterization of Irregular GPU Kernels 11

  12. Memory Coalescing Avg # of memory accesses by each global/local ld/st >1 = uncoalesced Percentage of stalls due to uncoalesced accesses Provides an upper bound on speedup Microarchitectural Performance Characterization of Irregular GPU Kernels 12

  13. Memory Coalescing New configuration to artificially remove pipeline stall penalty from non-coalesced accesses With no further improvements to memory pipeline, with increased-capacity miss queues and MSHRs Not intended to model realistic improvement Microarchitectural Performance Characterization of Irregular GPU Kernels 13

  14. L2 and DRAM Latency Scaled L2 hit and DRAM access latencies Doubled, halved, zeroed Most benchmarks more sensitive to L2 latency Even with input sizes several times the L2 capacity Microarchitectural Performance Characterization of Irregular GPU Kernels 14

  15. Interconnect and DRAM Bandwidth Halved/doubled interconnect (L2) bandwidth and DRAM bus width Benchmark sensitivities similar to latency results L2 large enough to keep sufficient warps ready Microarchitectural Performance Characterization of Irregular GPU Kernels 15

  16. Cache Behavior Very high miss ratios (generally >50% in L1) Irregular codes have much greater MPKI BFS & SSSP: lots of pointer-chasing, little spatial locality Microarchitectural Performance Characterization of Irregular GPU Kernels 16

  17. Cache Size Scaling Halved, doubled both (data) cache sizes Codes sensitive to interconnect bandwidth are also sensitive to L1D size BH tree prefixes: L2 better at exploiting locality in traversals Most codes hurt more by smaller L2 than L1D Microarchitectural Performance Characterization of Irregular GPU Kernels 17

  18. Individual Application Analysis Divergence penalty less than we expected Regular codes have mostly fully- occupied cycles Large memory access penalty in irregular apps Computation pipeline hazards (rather than LS) Synchronization penalty also below expectation Microarchitectural Performance Characterization of Irregular GPU Kernels 18

  19. Conclusions Irregular codes More load imbalance, branch divergence, and uncoalesced memory accesses than regular codes Less branch divergence, synchronization, and atomics penalty than we expected Software designers successfully addressing these issues To support irregular codes, architects should focus on improving memory-related slowdowns Improving L2 latency/bandwidth more important than improving DRAM latency/bandwidth Microarchitectural Performance Characterization of Irregular GPU Kernels 19

  20. Questions? Acknowledgments NSF Graduate Research Fellowship grant 1144466 NSF grants 1141022, 1217231, and 1438963 Grants and gifts from NVIDIA Corporation Microarchitectural Performance Characterization of Irregular GPU Kernels 20

  21. Related Work Simulator-based characterization studies Bakhoda et al. (ISPASS 09), Goswami et al. (IISWC 10), Blem et al. (EAMA 11), Che et al. (IISWC 10), Lee and Wu (ISPASS 14) CUDA SDK, Rodinia, Parboil (no focus on irregularity) Meng et al. (ISCA 10) dynamic warp hardware modification PTX emulator studies (also SDK, Rodinia, Parboil) Kerr et al. (IISWC 09) GPU Ocelot, Wu et al. (CACHES 11) Hardware performance counters Burtscher et al. (IISWC 12) LonestarGPU, Che et al. (IISWC 13) Microarchitectural Performance Characterization of Irregular GPU Kernels 22

  22. Input Sizes Code BFS Input NYC road network (~264K nodes, ~734K edges) (working set = 3898 kB = 5.08x L2 size) RMAT graph (250K nodes, 500K edges) 494K bodies, 1 time step BH (working set = 7718 kB = 10.05x L2 size) 50.4K nodes, ~100.3K triangles, maxfactor = 10 DMR (working set w/ maxfactor 10 = 7840 kB = 10.2x L2 size) 30K nodes, 60K triangles NYC road network (~264K nodes, ~734K edges) MST (working set = 3898 kB = 5.08x L2 size) RMAT graph (250K nodes, 500K edges) NYC road network (~264K nodes, ~734K edges) SSSP (working set = 3898 kB = 5.08x L2 size) RMAT graph (250K nodes, 500K edges) obs_error dataset (60 MB), 30 blocks, 24 warps/block FPC num_plasma dataset (34 MB), 30 blocks, 24 warps/block att48 (48 cities, 15K climbers) TSP eil51 (51 cities, 15K climbers) 23,040 bodies, 1 time step 256 options NB MC Microarchitectural Performance Characterization of Irregular GPU Kernels 23

  23. Secondary Inputs Microarchitectural Performance Characterization of Irregular GPU Kernels 24

  24. GPGPU-Sim Configurations L1D Latency ROP Bus width Ict L2 DRAM DRAM CP Sz (PS) Sz (PL) MQ MS MM Size MQ MS MM Default 240 200 32 4 Y 16 48 8 32 8 768 4 32 4 1/2x ROP 120 200 " " " " " " " " " " " " 2x ROP 480 200 " " " " " " " " " " " " 1/2x DRAM 240 100 " " " " " " " " " " " " 2x DRAM 240 400 " " " " " " " " " " " " No Latency 0 0 " " " " " " " " " " " " 1/2x L1D Cache 240 200 32 4 Y 8 24 8 32 8 768 4 32 4 2x L1D Cache " " " " " 32 96 " " " " " " " 1/2x L2 Cache " " " " " 16 48 " " " 384 " " " 2x L2 Cache " " " " " " " " " " 1536 " " " 1/2x DRAM Bandwidth 240 200 " 2 Y 16 48 8 32 8 768 4 32 4 2x DRAM Bandwidth " " " 8 " " " " " " " " " " 1/2x Ict + DRAM B/W " " 16 2 " " " " " " " " " " 2x Ict + DRAM B/W " " 64 8 " " " " " " " " " " No Coalesce Penalty 240 200 32 4 N 16 48 8 32 8 768 4 32 4 NCP + Impr L1 Miss " " " " N " " 16 64 16 " 4 32 4 NCP +Impr L1+L2 Miss Latencies represent number of shader core cycles. Cache sizes in kB. ROP=Raster Operations Pipeline (models L2 hit latency). Ict = Interconnect (flit size). CP=Coalesce penalty, PS = Prefer Shared Mem, PL = Prefer L1, MQ=Miss queue entries, MS=Miss status holding register entries, MM=Max MSHR merges " " " " N " " 16 64 16 " 8 64 8 Microarchitectural Performance Characterization of Irregular GPU Kernels 25

  25. Issue Bin Priority Is one at a synchronization barrier? Did a warp issue? Does warp with valid instruction exist? no no yes idle: sync barrier no yes Is one at a memory barrier? 32 active threads? yes full issue yes yes idle: mem barrier no no divergence Is one stalled for an atomic? yes Did one such warp fail scoreboard? no idle: atomic no pipe stall control hazard (full functional unit) yes scoreboard hazard (likely RAW on memory data) Microarchitectural Performance Characterization of Irregular GPU Kernels 26

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#