A Performance Analysis Framework for GPGPU Applications

Jaewoong Sim
 Aniruddha Dasgupta
Hyesoon Kim  Richard Vuduc
1
|
Motivation
|
GPUPerf: Performance analysis framework
Performance Advisor
Analytical Model
Frontend Data Collector
|
Evaluations
|
Conclusion
2/26
CPU
Version
GPGPU
Version
CPU
Version
GPGPU
Version
 
|
GPGPU architectures have become very powerful.
|
Programmers want to convert CPU applications to GPGPU applications.
 
 
|
Case 1: 10x speed-up 
 
 
 
 
 
 
 
 
 
 
|
Case 2: 1.1x speed-up 
 
 
 
 
 
 
 
|
For case 2, programmers might wonder why the benefit is 
so poor
.
Maybe, the algorithm is not parallelizable
Maybe, the GPGPU code are not well optimized
 
|
For case 1, programmers might wonder if 10x is the 
best
 speed-up.
 
 
 
3/26
Programmers want to optimize code whenever possible!
 
|
Optimizing parallel programs is difficult
^
100!
 
 
 
 
 
 
 
 
 
|
Most of programmers apply optimization techniques one by one.
 
|
Try one more optimization with 
Shared Memory
. Which one to choose?
4/26
Best for this kernel
Still the best!
Programmers want to understand performance benefit!
 
|
Providing performance guidance is not easy.
Program analysis: 
Obtain program information as much as possible
Performance modeling: 
Have a sophiscated analytical model
User-friendly metrics:
 Convert the performance analysis information
into performance guidance
 
|
We propose GPUPerf, performance analysis framework
Quantatively predicts potential performance benefits
 
 
|
In this talk, we will focus more on performance modeling
and potential benefit metrics
5/26
|
Motivation
|
GPUPerf: Performance Analysis Framework
Performance Advisor
Analytical Model
Frontend Data Collector
|
Evaluations
|
Conclusion
6/26
 
|
What is required for performance guidance?
Program analysis
Performance modeling
User-friendly metrics
GPGPU
Kernel
Frontend
Data Collector
Analytical
Model
Performance
Advisor
GPUPerf
7/26
 
ILP, #insts
 
Model output
Benefit
Metrics
For clarity, each component will be
explained in a reverse order
 
|
Goal of the performance advisor
Convey performance bottleneck information
Estimate the 
potential gains
 from reducing the bottlenecks
 
 
|
Performance advisor provides four potential benefit metrics
B
itilp
    : benefits of increasing 
ITILP
B
memlp
 : benefits of increasing 
MLP
B
serial
  : benefits of removing 
serialization effects
B
fp
     : benefits of improving 
computing inefficiency
 
|
Programmers can get an idea of the potential benefit of a
GPGPU Kernel
Benefit metrics are provided
by our analytical model
8/26
9/26
 
|
Depending on MWP and CWP, the execution time is predicted by
the model.
 
|
The MWP-CWP model can predict general cases.
 
|
Problem: did not model corner cases, which is critical to predict
different program optimization benefits!
 
|
MWP (Memory Warp Parllelism)
Indicator of memory-level parallelism
 
|
CWP (Compute Warp Parllelism)
MWP-CWP [Hong and Kim, ISCA’09]
 
|
Our analytical model follows a top-down approach
Easy to interpret model components
Relate them directly to performance bottlenecks
T
exec
T
comp
T
mem
T
overlap
 
T
exec     
: 
Final execution time
 
T
comp
  
: Computation time
T
mem
   
: Memory time
T
overlap
 
: Overlapped time
 
T
comp
 
T
mem
T
overlap
T
exec
 = T
comp
 + T
mem
 - T
overlap
10/26
 
4 warps
 
MWP=2
 
Time
|
T
comp 
is the amount of time to execute compute instructions
T
exec
T
comp
T
mem
T
overlap
W
serial
W
parallel
 
W
parallel
 
: Work executed in parallel (useful work)
W
serial
 
: Overhead due to serialization effects
 
T
comp
 = W
parallel
 + W
serial
11/26
|
W
parallel 
is the amount of work that can be executed in parallel
T
exec
T
comp
T
mem
T
overlap
W
serial
W
parallel
Effective inst. throughput =
 f(warp_size, SIMD_width, # pipeline stages)
12/26
ITILP represents the number of
instructions that can be parallely
executed in the pipeline.
|
ITILP is inter-thread ILP.
 
Time
 
TLP (N)
13/26
Execution latency is
already all hidden!
TLP =1
 
TLP =2
 
TLP =3
ILP =4/3
 
TLP =1
ITILP=4/3
 
TLP =2
ITILP=8/3
 
TLP =3
ITILP=ITILP max
Low ITILP
As TLP increases,
stall time reduces
|
W
serial
 represents the overhead due to serialization effects
T
exec
T
comp
T
mem
T
overlap
W
serial
W
parallel
O
sync
O
SFU
O
CFDiv
O
bank
 
W
serial
 = O
sync
 + O
SFU
 + O
CFDiv
 + O
bank
 
O
sync
 
: Synchroization overhead
O
SFU 
: SFU contention overhead
O
CFDiv
 : branch divergence overhead
O
bank
 : Bank-conflict overhead
 
14/26
 
|
GPGPU has SFUs where expensive operations can be executed.
With a good ratio of insts and SFU insts, SFU executing cost can be hidden.
Inst
SFU Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
SFU Inst
Inst
SFU Inst
SFU Inst
Inst
SFU Inst
SFU Inst
Inst
Inst
SFU Inst
Inst
SFU Inst
Latency of SFU instructions is not
completely hidden in this case!
15/26
 
High 
Inst to SFU ratio
 
Low
 Inst to SFU ratio
T
exec
T
comp
T
mem
T
overlap
|
T
mem 
represents the amount of time spent on memory requests and
transfers
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
 
T
mem = 4MEM / 2
 
T
mem = 4MEM / 1
T
mem 
= Effective mem. requests × AMAT
16/26
 
MWP=2
 
MWP=1
|
T
overlap 
represents how much the memory cost can be hidden by multi-
threading
T
exec
T
comp
T
mem
T
overlap
All the memory costs are
overlapped with computation
17/26
 
MWP=3
 
CWP=3
Mem
Mem
|
T
overlap 
represents how much the memory access cost can be hidden by
multi-threading
T
exec
T
comp
T
mem
T
overlap
Comp
Mem
Mem
Comp
Comp
Comp
Comp
Computation cost is hidden
by memory cost
18/26
 
CWP > MWP
 
MWP=2
 
CWP=4
|
Time metrics are converted into potential benefit metrics.
 
T
mem
 
Mem Cost
 
Comp Cost
 
T
comp
 
T
mem’
 
T
overlap
 
T
mem_min
 
T
fp
Potential Benefit Chart
 
B
memlp
 
B
serial
 
B
itilp
 
B
fp
 
T
fp
 
: 
ideal computation cost
T
mem_min
 
: ideal memory cost
19/26
 
Single Thread
Optimized
Kernel
 
|
Architecture-related information
from H/W counters
#insts, global LD/ST requests, cache info
Compute
Visual Profiler
Instruction
Analyzer (IA)
Static Analysis
Tools
 
CUDA
Executable
 
Ocelot
Executable
 
CUDA Binary
(CUBIN)
 
#Insts
Occupancy
 
#SFU_Insts
 
ILP, MLP, ...
 
|
Information in CUDA binaries
instead of PTX after low-level
compiler optimizations
ILP, MLP
 
|
Detailed information from emulating
PTX executions
#SFU insts, #sync insts, loop counters
The collected information is fed into
our analytical model
20/26
Ocelot
[Diamos et al.,
PACT’10]
|
Motivation
|
GPUPerf: A Performance Analysis Framework
Performance Advisor
Analytical Model
Frontend Data Collector
|
Evaluations
|
Conclusion
21/26
22/26
|
NVIDIA C2050 Fermi architecture
|
FMM (Fast Multi-pole Method): approximation of n-body
problem
|
Parboil benchmarks, Reduction (in the paper)
Prefetching
Loop
Unrolling
SFU
Shared
Memory
Vector
Packing
Loop
optimization
[Winner, 2010 Gordon Bell Prize at Supercomputing] 
23/26
Vector packing + Shared memory + Unroll-Jam +
SFU combination shows the best performance
24/26
Our model follows the
speed-up trend pretty well
Our model correctly pinpoints the best
optimization combination that improves the kernel
 
|
B
fp
 implies that the kernel could be improved via optimizations
 
|
Small B
fp 
value indicates that adding Prefetching(
Pref) 
does not lead to
further performance improvement
25/26
B
fp
 : Computing ineffciency
(Higher is wrose)
|
We propose performance analysis framework.
Front-end data collector, analytical model and performance advisor.
|
Performance advisor provides potential benefit metrics,
which can guide performance tuning for GPGPU code.
 (B
memlp
, B
serial
, B
itilp
, B
fp
).
 44 optimization combinations in FMM are well predicted.
|
Future work: the performance benefit advisor can be inputs
to compilers.
26/26
Slide Note

Today, I am going to present our work, “A Performance Analysis Framework for Identifying Potential Benefits in GPGPU applications”.

This work is done with Anirdudha Dasgupta, professor Hyesson Kim and professor Richard vuduc.

Embed
Share

This framework, GPUPerf, focuses on identifying potential benefits in GPGPU applications through performance analysis, modeling, and user-friendly metrics. It addresses the challenges programmers face in optimizing GPGPU code, providing guidance on program analysis and performance modeling. The framework aims to quantitatively predict performance improvements by evaluating various optimizations and guiding programmers in making informed decisions.


Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications Jaewoong Sim Aniruddha Dasgupta Hyesoon Kim Richard Vuduc 1

  2. Outline 2/26 | Motivation | GPUPerf: Performance analysis framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion

  3. GPGPU Programming 3/26 | GPGPU architectures have become very powerful. | Programmers want to convert CPU applications to GPGPU applications. | Case 1: 10x speed-up GPGPU Version CPU Version | Case 2: 1.1x speed-up CPU Version GPGPU Version | For case 2, programmers might wonder why the benefit is so poor. Maybe, the algorithm is not parallelizable Maybe, the GPGPU code are not well optimized Programmers want to optimize code whenever possible! | For case 1, programmers might wonder if 10x is the best speed-up.

  4. GPGPU Optimizations 4/26 | Optimizing parallel programs is difficult^100! 1.8 Normalized Performance Best for this kernel Still the best! 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Baseline Shared Memory SFU Tight UJAM Shared Memory + SFU Shared Memory + Tight Shared Memory + UJAM One Optimization Shared Memory + Another | Most of programmers apply optimization techniques one by one. Programmers want to understand performance benefit! | Try one more optimization with Shared Memory. Which one to choose?

  5. GPGPU Performance Guidance 5/26 | Providing performance guidance is not easy. Program analysis: Obtain program information as much as possible Performance modeling: Have a sophiscated analytical model User-friendly metrics: Convert the performance analysis information into performance guidance | We propose GPUPerf, performance analysis framework Quantatively predicts potential performance benefits | In this talk, we will focus more on performance modeling and potential benefit metrics

  6. Outline 6/26 | Motivation | GPUPerf: Performance Analysis Framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion

  7. GPUPerf Overview 7/26 | What is required for performance guidance? Program analysis Performance modeling User-friendly metrics Model output ILP, #insts Frontend Data Collector Analytical Model Performance Advisor GPGPU Kernel Benefit Metrics GPUPerf For clarity, each component will be explained in a reverse order

  8. Performance Advisor Frontend Data Collector Analytical Model Performance Advisor 8/26 | Goal of the performance advisor Convey performance bottleneck information Estimate the potential gains from reducing the bottlenecks | Performance advisor provides four potential benefit metrics Bitilp : benefits of increasing ITILP Bmemlp : benefits of increasing MLP Bserial : benefits of removing serialization effects Bfp : benefits of improving computing inefficiency Benefit metrics are provided by our analytical model | Programmers can get an idea of the potential benefit of a GPGPU Kernel

  9. Previous Work: MWP-CWP Model 9/26 | MWP (Memory Warp Parllelism) Indicator of memory-level parallelism | CWP (Compute Warp Parllelism) Mem Comp Mem Mem Mem Comp Mem Mem MWP=4 Comp Mem Mem Mem CWP=3 Time 8 warps | Depending on MWP and CWP, the execution time is predicted by the model. MWP-CWP [Hong and Kim, ISCA 09] | The MWP-CWP model can predict general cases. | Problem: did not model corner cases, which is critical to predict different program optimization benefits!

  10. Analytical Model Frontend Data Collector Analytical Model Performance Advisor 10/26 | Our analytical model follows a top-down approach Easy to interpret model components Relate them directly to performance bottlenecks Texec Texec = Tcomp + Tmem - Toverlap Tcomp Tmem Toverlap Texec : Final execution time Comp Mem Comp Mem Tcomp: Computation time Tmem: Memory time Toverlap: Overlapped time Comp Mem Comp Mem 4 warps Tcomp Tmem Mem Toverlap Mem Comp Comp Comp Comp MWP=2 Mem Mem Time

  11. Analytical Model Frontend Data Collector Analytical Model Performance Advisor 11/26 | Tcomp is the amount of time to execute compute instructions Tcomp = Wparallel + Wserial Texec Tcomp Tmem Toverlap Wparallel Wserial Wparallel: Work executed in parallel (useful work) Wserial: Overhead due to serialization effects

  12. Analytical Model Frontend Data Collector Analytical Model Performance Advisor 12/26 | Wparallel is the amount of work that can be executed in parallel Texec Tcomp Tmem Toverlap Wparallel Wserial Effective inst. throughput = f(warp_size, SIMD_width, # pipeline stages) ITILP represents the number of | Wparallel = Total insts Effective inst. throughput instructions that can be parallely executed in the pipeline. | Effective Inst. throughput = average_instruction_latency ITILP

  13. Analytical Model ITILP (Inter-thread ILP) Frontend Data Collector Analytical Model Performance Advisor 13/26 TLP =2 ITILP=8/3 TLP =3 TLP =1 ITILP=4/3 | ITILP is inter-thread ILP. ITILP=ITILP max Inst1 Inst1 Inst1 25 Inst1 Inst1 ExecutionTime (msec) Actual Model (MWP-CWP) As TLP increases, TLP =1 TLP =2 TLP =3 stall stall stall Inst1 20 Execution latency is Model (New) Inst1 Inst1 15 Inst2 stall time reduces Inst2 already all hidden! Inst1 Inst1 Low ITILP Inst2 ILP =4/3 Inst2 Inst2 Inst3 Inst3 10 Inst3 Inst3 Inst3 Inst2 Inst2 Inst3 Inst3 5 Inst2 Inst2 stall stall Inst4 Inst4 0 stall TLP (N) Inst4 Inst4 1 Inst3 Inst3 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 ILP X TLP Inst4 Inst4 Inst4 Inst4 ITILP = MIN(ILP N, ITILPmax) Inst4 Inst2 warp_size SIMD_width ITILPmax =avg_inst_lat/ Time Inst3

  14. Analytical Model Frontend Data Collector Analytical Model Performance Advisor 14/26 | Wserial represents the overhead due to serialization effects Wserial = Osync + OSFU + OCFDiv + Obank Texec Tcomp Tmem Toverlap Wparallel Wserial Osync: Synchroization overhead OSFU : SFU contention overhead OCFDiv : branch divergence overhead Obank : Bank-conflict overhead OSFU Osync OCFDiv Obank

  15. Analytical Model SFU Overhead Frontend Data Collector Analytical Model Performance Advisor 15/26 | GPGPU has SFUs where expensive operations can be executed. With a good ratio of insts and SFU insts, SFU executing cost can be hidden. SFU Inst Inst Inst Inst Inst Inst Inst Inst Inst Inst Inst SFU Inst SFU Inst SFU Inst High Inst to SFU ratio OSFU Inst Inst SFU Inst Inst Inst SFU Inst SFU Inst SFU Inst Low Inst to SFU ratio 600 Execution Time (msec) Actual Model (MWP-CWP) Model (New) 500 400 Latency of SFU instructions is not completely hidden in this case! 300 200 100 0 1 2 3 4 5 6 7 # of SFU insts. per eight FMA insts.

  16. Analytical Model Frontend Data Collector Analytical Model Performance Advisor 16/26 | Tmem represents the amount of time spent on memory requests and transfers Tmem = Effective mem. requests AMAT Texec Tcomp Tmem Toverlap Mem Mem MWP=2 Comp Mem Mem Mem Comp Mem Tmem = 4MEM / 2 Comp Mem Comp Mem Mem Mem Mem Mem MWP=1 Tmem = 4MEM / 1

  17. Analytical Model Frontend Data Collector Analytical Model Performance Advisor 17/26 | Toverlap represents how much the memory cost can be hidden by multi- threading Texec Tcomp Tmem Toverlap Toverlap Tmem Mem Comp MWP=3 Mem Comp Mem Comp All the memory costs are overlapped with computation Comp Comp MWP CWP CWP=3

  18. Analytical Model Frontend Data Collector Analytical Model Performance Advisor 18/26 | Toverlap represents how much the memory access cost can be hidden by multi-threading Texec Tcomp Tmem Toverlap Toverlap Tcomp Mem Mem Mem Comp MWP=2 Mem Mem Comp Comp Comp Computation cost is hidden by memory cost CWP=4 Comp CWP > MWP Comp

  19. Benefit Chart Frontend Data Collector Analytical Model Performance Advisor 19/26 | Time metrics are converted into potential benefit metrics. Comp Cost Toverlap Bmemlp Tcomp Single Thread Tfp : ideal computation cost Tmem_min : ideal memory cost Bserial Bitilp Benefit Metrics Benefits of Bfp Optimized Kernel Bmemlp Increasing MLP Tfp Bserial Removing serialization effects Mem Cost Bitilp Increasing inter-thread ILP Tmem Tmem_min Tmem Bfp Improving computing efficiency Potential Benefit Chart

  20. Frontend Data Collector Frontend Data Collector Analytical Model Performance Advisor 20/26 Frontend Data Collector Analytical Model Performance Advisor #Insts Occupancy Compute Visual Profiler | Architecture-related information from H/W counters #insts, global LD/ST requests, cache info CUDA Executable Ocelot [Diamos et al., PACT 10] #SFU_Insts Instruction Analyzer (IA) Ocelot Executable | Detailed information from emulating PTX executions #SFU insts, #sync insts, loop counters ILP, MLP, ... CUDA Binary (CUBIN) Static Analysis Tools | Information in CUDA binaries instead of PTX after low-level compiler optimizations ILP, MLP The collected information is fed into our analytical model

  21. Outline 21/26 | Motivation | GPUPerf: A Performance Analysis Framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion

  22. Evaluation 22/26 | NVIDIA C2050 Fermi architecture | FMM (Fast Multi-pole Method): approximation of n-body problem [Winner, 2010 Gordon Bell Prize at Supercomputing] Vector Packing Prefetching SFU 44 Optimization combinations Loop Unrolling Shared Memory Loop optimization | Parboil benchmarks, Reduction (in the paper)

  23. Results - FMM 23/26 3.5 Actual Speedup over no optimizations 3 2.5 2 1.5 1 0.5 0 pref ujam_rsqrt ujam shmem ujam_tight tight rsqrt_tight rsqrt vecpack vecpack_pref vecpack_ujam pref_rsqrt shmem_ujam vecpack_shmem shmem_trans_ujam shmem_ujam_rsqrt shmem_ujam_tight ujam_rsqrt_tight vecpack_pref_ujam shmem_tight shmem_trans shmem_rsqrt_tight shmem_trans_tight vecpack_pref_ujam_rsqrt vecpack_rsqrt shmem_rsqrt shmem_trans_rsqrt_tight shmem_pref_rsqrt_tight shmem_trans_rsqrt vecpack_pref_rsqrt vecpack_shmem_ujam_rsqrt vecpack_shmem_pref_rsqrt shmem_ujam_rsqrt_tight vecpack_shmem_rsqrt shmem_trans_ujam_tight vecpack_shmem_trans shmem_pref_ujam_rsqrt_tight shmem_trans_ujam_rsqrt_tight shmem_trans_ujam_rsqrt vecpack_shmem_pref_ujam_rsqrt vecpack_shmem_trans_rsqrt shmem_trans_pref_ujam_rsqrt_tight vecpack_shmem_trans_pref_rsqrt vecpack_shmem_trans_ujam_rsqrt Vector packing + Shared memory + Unroll-Jam + SFU combination shows the best performance Optimizations

  24. Results - FMM 24/26 3.5 Our model follows the speed-up trend pretty well Actual Prediction Speedup over no optimizations 3 2.5 2 1.5 1 0.5 0 pref ujam_rsqrt ujam shmem ujam_tight tight rsqrt_tight rsqrt vecpack vecpack_pref vecpack_ujam pref_rsqrt shmem_ujam vecpack_shmem shmem_trans_ujam shmem_ujam_rsqrt shmem_ujam_tight ujam_rsqrt_tight vecpack_pref_ujam shmem_tight shmem_trans shmem_rsqrt_tight shmem_trans_tight vecpack_pref_ujam_rsqrt vecpack_rsqrt shmem_rsqrt shmem_trans_rsqrt_tight shmem_pref_rsqrt_tight shmem_trans_rsqrt vecpack_pref_rsqrt vecpack_shmem_ujam_rsqrt vecpack_shmem_pref_rsqrt shmem_ujam_rsqrt_tight vecpack_shmem_rsqrt shmem_trans_ujam_tight vecpack_shmem_trans shmem_pref_ujam_rsqrt_tight shmem_trans_ujam_rsqrt_tight shmem_trans_ujam_rsqrt vecpack_shmem_pref_ujam_rsqrt vecpack_shmem_trans_rsqrt shmem_trans_pref_ujam_rsqrt_tight vecpack_shmem_trans_pref_rsqrt vecpack_shmem_trans_ujam_rsqrt Our model correctly pinpoints the best optimization combination that improves the kernel Optimizations

  25. Results Potential Benefits 25/26 3.5 0.35 Normalized Benefits 3 0.3 2.5 0.25 Speedup 2 0.2 1.5 0.15 B_fp Speedup (Actual) 1 0.1 0.5 0.05 0 0 baseline vecpack _shmem_ujam _shmem_ujam vecpack_rsqrt vecpack_rsqrt vecpack_rsqrt vecpack_rsqrt _shmem _pref Bfp : Computing ineffciency (Higher is wrose) | Bfp implies that the kernel could be improved via optimizations | Small Bfp value indicates that adding Prefetching(Pref) does not lead to further performance improvement

  26. Conclusion 26/26 | We propose performance analysis framework. Front-end data collector, analytical model and performance advisor. | Performance advisor provides potential benefit metrics, which can guide performance tuning for GPGPU code. (Bmemlp, Bserial, Bitilp, Bfp). 44 optimization combinations in FMM are well predicted. | Future work: the performance benefit advisor can be inputs to compilers.

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#