A Performance Analysis Framework for GPGPU Applications

Jaewoong Sim

 Aniruddha Dasgupta

Hyesoon Kim  Richard Vuduc

Motivation

GPUPerf: Performance analysis framework



Performance Advisor



Analytical Model



Frontend Data Collector

Evaluations

Conclusion

2/26

CPU

Version

GPGPU

Version

CPU

Version

GPGPU

Version

GPGPU architectures have become very powerful.

Programmers want to convert CPU applications to GPGPU applications.

Case 1: 10x speed-up



Case 2: 1.1x speed-up



For case 2, programmers might wonder why the benefit is

so poor



Maybe, the algorithm is not parallelizable



Maybe, the GPGPU code are not well optimized

For case 1, programmers might wonder if 10x is the

best

 speed-up.

3/26

Programmers want to optimize code whenever possible!

Optimizing parallel programs is difficult

100!

Most of programmers apply optimization techniques one by one.

Try one more optimization with

Shared Memory

. Which one to choose?

4/26

Best for this kernel

Still the best!

Programmers want to understand performance benefit!

Providing performance guidance is not easy.



Program analysis:

Obtain program information as much as possible



Performance modeling:

Have a sophiscated analytical model



User-friendly metrics:

 Convert the performance analysis information

into performance guidance

We propose GPUPerf, performance analysis framework



Quantatively predicts potential performance benefits

In this talk, we will focus more on performance modeling

and potential benefit metrics

5/26

Motivation

GPUPerf: Performance Analysis Framework



Performance Advisor



Analytical Model



Frontend Data Collector

Evaluations

Conclusion

6/26

What is required for performance guidance?



Program analysis



Performance modeling



User-friendly metrics

GPGPU

Kernel

Frontend

Data Collector

Analytical

Model

Performance

Advisor

GPUPerf

7/26

ILP, #insts

Model output

Benefit

Metrics

For clarity, each component will be

explained in a reverse order

Goal of the performance advisor



Convey performance bottleneck information



Estimate the

potential gains

 from reducing the bottlenecks

Performance advisor provides four potential benefit metrics



itilp

    : benefits of increasing

ITILP



memlp

 : benefits of increasing

MLP



serial

  : benefits of removing

serialization effects



fp

     : benefits of improving

computing inefficiency

Programmers can get an idea of the potential benefit of a

GPGPU Kernel

Benefit metrics are provided

by our analytical model

8/26

9/26

Depending on MWP and CWP, the execution time is predicted by

the model.

The MWP-CWP model can predict general cases.

Problem: did not model corner cases, which is critical to predict

different program optimization benefits!

MWP (Memory Warp Parllelism)



Indicator of memory-level parallelism

CWP (Compute Warp Parllelism)

MWP-CWP [Hong and Kim, ISCA’09]

Our analytical model follows a top-down approach



Easy to interpret model components



Relate them directly to performance bottlenecks

exec

comp

mem

overlap



exec

Final execution time



comp

: Computation time



mem

: Memory time



overlap

: Overlapped time

comp

mem

overlap

exec

= T

comp

+ T

mem

- T

overlap

10/26

4 warps

MWP=2

Time

comp

is the amount of time to execute compute instructions

exec

comp

mem

overlap

serial

parallel



parallel

: Work executed in parallel (useful work)



serial

: Overhead due to serialization effects

comp

= W

parallel

+ W

serial

11/26

parallel

is the amount of work that can be executed in parallel

exec

comp

mem

overlap

serial

parallel

Effective inst. throughput =

 f(warp_size, SIMD_width, # pipeline stages)

12/26

ITILP represents the number of

instructions that can be parallely

executed in the pipeline.

ITILP is inter-thread ILP.

Time

TLP (N)

13/26

Execution latency is

already all hidden!

TLP =1

TLP =2

TLP =3

ILP =4/3

TLP =1

ITILP=4/3

TLP =2

ITILP=8/3

TLP =3

ITILP=ITILP max

Low ITILP

As TLP increases,

stall time reduces

serial

 represents the overhead due to serialization effects

exec

comp

mem

overlap

serial

parallel

sync

SFU

CFDiv

bank

serial

= O

sync

+ O

SFU

+ O

CFDiv

+ O

bank



sync

: Synchroization overhead



SFU

: SFU contention overhead



CFDiv

 : branch divergence overhead



bank

 : Bank-conflict overhead

14/26

GPGPU has SFUs where expensive operations can be executed.



With a good ratio of insts and SFU insts, SFU executing cost can be hidden.

Inst

SFU Inst

Inst

Inst

Inst

Inst

Inst

Inst

Inst

Inst

SFU Inst

Inst

SFU Inst

SFU Inst

Inst

SFU Inst

SFU Inst

Inst

Inst

SFU Inst

Inst

SFU Inst

Latency of SFU instructions is not

completely hidden in this case!

15/26

High

Inst to SFU ratio

Low

 Inst to SFU ratio

exec

comp

mem

overlap

mem

represents the amount of time spent on memory requests and

transfers

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

mem = 4MEM / 2

mem = 4MEM / 1

mem

= Effective mem. requests × AMAT

16/26

MWP=2

MWP=1

overlap

represents how much the memory cost can be hidden by multi-

threading

exec

comp

mem

overlap

All the memory costs are

overlapped with computation

17/26

MWP=3

CWP=3

Mem

Mem

overlap

represents how much the memory access cost can be hidden by

multi-threading

exec

comp

mem

overlap

Comp

Mem

Mem

Comp

Comp

Comp

Comp

Computation cost is hidden

by memory cost

18/26

CWP > MWP

MWP=2

CWP=4

Time metrics are converted into potential benefit metrics.

mem

Mem Cost

Comp Cost

comp

mem’

overlap

mem_min

fp

Potential Benefit Chart

memlp

serial

itilp

fp



fp

ideal computation cost



mem_min

: ideal memory cost

19/26

Single Thread

Optimized

Kernel

Architecture-related information

from H/W counters



#insts, global LD/ST requests, cache info

Compute

Visual Profiler

Instruction

Analyzer (IA)

Static Analysis

Tools

CUDA

Executable

Ocelot

Executable

CUDA Binary

(CUBIN)

#Insts

Occupancy

#SFU_Insts

ILP, MLP, ...

Information in CUDA binaries

instead of PTX after low-level

compiler optimizations



ILP, MLP

Detailed information from emulating

PTX executions



#SFU insts, #sync insts, loop counters

The collected information is fed into

our analytical model

20/26

Ocelot

[Diamos et al.,

PACT’10]

Motivation

GPUPerf: A Performance Analysis Framework



Performance Advisor



Analytical Model



Frontend Data Collector

Evaluations

Conclusion

21/26

22/26

NVIDIA C2050 Fermi architecture

FMM (Fast Multi-pole Method): approximation of n-body

problem

Parboil benchmarks, Reduction (in the paper)

Prefetching

Loop

Unrolling

SFU

Shared

Memory

Vector

Packing

Loop

optimization

[Winner, 2010 Gordon Bell Prize at Supercomputing]

23/26

Vector packing + Shared memory + Unroll-Jam +

SFU combination shows the best performance

24/26

Our model follows the

speed-up trend pretty well

Our model correctly pinpoints the best

optimization combination that improves the kernel

fp

 implies that the kernel could be improved via optimizations

Small B

fp

value indicates that adding Prefetching(

Pref)

does not lead to

further performance improvement

25/26

fp

 : Computing ineffciency

(Higher is wrose)

We propose performance analysis framework.



Front-end data collector, analytical model and performance advisor.

Performance advisor provides potential benefit metrics,

which can guide performance tuning for GPGPU code.



(B

memlp

, B

serial

, B

itilp

, B

fp

).



 44 optimization combinations in FMM are well predicted.

Future work: the performance benefit advisor can be inputs

to compilers.

26/26

Slide Note

Today, I am going to present our work, “A Performance Analysis Framework for Identifying Potential Benefits in GPGPU applications”.

This work is done with Anirdudha Dasgupta, professor Hyesson Kim and professor Richard vuduc.

Embed Share

Download Presentation

This framework, GPUPerf, focuses on identifying potential benefits in GPGPU applications through performance analysis, modeling, and user-friendly metrics. It addresses the challenges programmers face in optimizing GPGPU code, providing guidance on program analysis and performance modeling. The framework aims to quantitatively predict performance improvements by evaluating various optimizations and guiding programmers in making informed decisions.

nam_ner Follow

Uploaded on Sep 15, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications Jaewoong Sim Aniruddha Dasgupta Hyesoon Kim Richard Vuduc 1

Outline 2/26 | Motivation | GPUPerf: Performance analysis framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion

GPGPU Programming 3/26 | GPGPU architectures have become very powerful. | Programmers want to convert CPU applications to GPGPU applications. | Case 1: 10x speed-up GPGPU Version CPU Version | Case 2: 1.1x speed-up CPU Version GPGPU Version | For case 2, programmers might wonder why the benefit is so poor. Maybe, the algorithm is not parallelizable Maybe, the GPGPU code are not well optimized Programmers want to optimize code whenever possible! | For case 1, programmers might wonder if 10x is the best speed-up.

GPGPU Optimizations 4/26 | Optimizing parallel programs is difficult^100! 1.8 Normalized Performance Best for this kernel Still the best! 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Baseline Shared Memory SFU Tight UJAM Shared Memory + SFU Shared Memory + Tight Shared Memory + UJAM One Optimization Shared Memory + Another | Most of programmers apply optimization techniques one by one. Programmers want to understand performance benefit! | Try one more optimization with Shared Memory. Which one to choose?

GPGPU Performance Guidance 5/26 | Providing performance guidance is not easy. Program analysis: Obtain program information as much as possible Performance modeling: Have a sophiscated analytical model User-friendly metrics: Convert the performance analysis information into performance guidance | We propose GPUPerf, performance analysis framework Quantatively predicts potential performance benefits | In this talk, we will focus more on performance modeling and potential benefit metrics

Outline 6/26 | Motivation | GPUPerf: Performance Analysis Framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion

GPUPerf Overview 7/26 | What is required for performance guidance? Program analysis Performance modeling User-friendly metrics Model output ILP, #insts Frontend Data Collector Analytical Model Performance Advisor GPGPU Kernel Benefit Metrics GPUPerf For clarity, each component will be explained in a reverse order

Performance Advisor Frontend Data Collector Analytical Model Performance Advisor 8/26 | Goal of the performance advisor Convey performance bottleneck information Estimate the potential gains from reducing the bottlenecks | Performance advisor provides four potential benefit metrics Bitilp : benefits of increasing ITILP Bmemlp : benefits of increasing MLP Bserial : benefits of removing serialization effects Bfp : benefits of improving computing inefficiency Benefit metrics are provided by our analytical model | Programmers can get an idea of the potential benefit of a GPGPU Kernel

Previous Work: MWP-CWP Model 9/26 | MWP (Memory Warp Parllelism) Indicator of memory-level parallelism | CWP (Compute Warp Parllelism) Mem Comp Mem Mem Mem Comp Mem Mem MWP=4 Comp Mem Mem Mem CWP=3 Time 8 warps | Depending on MWP and CWP, the execution time is predicted by the model. MWP-CWP [Hong and Kim, ISCA 09] | The MWP-CWP model can predict general cases. | Problem: did not model corner cases, which is critical to predict different program optimization benefits!

Analytical Model Frontend Data Collector Analytical Model Performance Advisor 10/26 | Our analytical model follows a top-down approach Easy to interpret model components Relate them directly to performance bottlenecks Texec Texec = Tcomp + Tmem - Toverlap Tcomp Tmem Toverlap Texec : Final execution time Comp Mem Comp Mem Tcomp: Computation time Tmem: Memory time Toverlap: Overlapped time Comp Mem Comp Mem 4 warps Tcomp Tmem Mem Toverlap Mem Comp Comp Comp Comp MWP=2 Mem Mem Time

Analytical Model Frontend Data Collector Analytical Model Performance Advisor 11/26 | Tcomp is the amount of time to execute compute instructions Tcomp = Wparallel + Wserial Texec Tcomp Tmem Toverlap Wparallel Wserial Wparallel: Work executed in parallel (useful work) Wserial: Overhead due to serialization effects

Analytical Model Frontend Data Collector Analytical Model Performance Advisor 12/26 | Wparallel is the amount of work that can be executed in parallel Texec Tcomp Tmem Toverlap Wparallel Wserial Effective inst. throughput = f(warp_size, SIMD_width, # pipeline stages) ITILP represents the number of | Wparallel = Total insts Effective inst. throughput instructions that can be parallely executed in the pipeline. | Effective Inst. throughput = average_instruction_latency ITILP

Analytical Model ITILP (Inter-thread ILP) Frontend Data Collector Analytical Model Performance Advisor 13/26 TLP =2 ITILP=8/3 TLP =3 TLP =1 ITILP=4/3 | ITILP is inter-thread ILP. ITILP=ITILP max Inst1 Inst1 Inst1 25 Inst1 Inst1 ExecutionTime (msec) Actual Model (MWP-CWP) As TLP increases, TLP =1 TLP =2 TLP =3 stall stall stall Inst1 20 Execution latency is Model (New) Inst1 Inst1 15 Inst2 stall time reduces Inst2 already all hidden! Inst1 Inst1 Low ITILP Inst2 ILP =4/3 Inst2 Inst2 Inst3 Inst3 10 Inst3 Inst3 Inst3 Inst2 Inst2 Inst3 Inst3 5 Inst2 Inst2 stall stall Inst4 Inst4 0 stall TLP (N) Inst4 Inst4 1 Inst3 Inst3 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 ILP X TLP Inst4 Inst4 Inst4 Inst4 ITILP = MIN(ILP N, ITILPmax) Inst4 Inst2 warp_size SIMD_width ITILPmax =avg_inst_lat/ Time Inst3

Analytical Model Frontend Data Collector Analytical Model Performance Advisor 14/26 | Wserial represents the overhead due to serialization effects Wserial = Osync + OSFU + OCFDiv + Obank Texec Tcomp Tmem Toverlap Wparallel Wserial Osync: Synchroization overhead OSFU : SFU contention overhead OCFDiv : branch divergence overhead Obank : Bank-conflict overhead OSFU Osync OCFDiv Obank

Analytical Model SFU Overhead Frontend Data Collector Analytical Model Performance Advisor 15/26 | GPGPU has SFUs where expensive operations can be executed. With a good ratio of insts and SFU insts, SFU executing cost can be hidden. SFU Inst Inst Inst Inst Inst Inst Inst Inst Inst Inst Inst SFU Inst SFU Inst SFU Inst High Inst to SFU ratio OSFU Inst Inst SFU Inst Inst Inst SFU Inst SFU Inst SFU Inst Low Inst to SFU ratio 600 Execution Time (msec) Actual Model (MWP-CWP) Model (New) 500 400 Latency of SFU instructions is not completely hidden in this case! 300 200 100 0 1 2 3 4 5 6 7 # of SFU insts. per eight FMA insts.

Analytical Model Frontend Data Collector Analytical Model Performance Advisor 16/26 | Tmem represents the amount of time spent on memory requests and transfers Tmem = Effective mem. requests AMAT Texec Tcomp Tmem Toverlap Mem Mem MWP=2 Comp Mem Mem Mem Comp Mem Tmem = 4MEM / 2 Comp Mem Comp Mem Mem Mem Mem Mem MWP=1 Tmem = 4MEM / 1

Analytical Model Frontend Data Collector Analytical Model Performance Advisor 17/26 | Toverlap represents how much the memory cost can be hidden by multi- threading Texec Tcomp Tmem Toverlap Toverlap Tmem Mem Comp MWP=3 Mem Comp Mem Comp All the memory costs are overlapped with computation Comp Comp MWP CWP CWP=3

Analytical Model Frontend Data Collector Analytical Model Performance Advisor 18/26 | Toverlap represents how much the memory access cost can be hidden by multi-threading Texec Tcomp Tmem Toverlap Toverlap Tcomp Mem Mem Mem Comp MWP=2 Mem Mem Comp Comp Comp Computation cost is hidden by memory cost CWP=4 Comp CWP > MWP Comp

Benefit Chart Frontend Data Collector Analytical Model Performance Advisor 19/26 | Time metrics are converted into potential benefit metrics. Comp Cost Toverlap Bmemlp Tcomp Single Thread Tfp : ideal computation cost Tmem_min : ideal memory cost Bserial Bitilp Benefit Metrics Benefits of Bfp Optimized Kernel Bmemlp Increasing MLP Tfp Bserial Removing serialization effects Mem Cost Bitilp Increasing inter-thread ILP Tmem Tmem_min Tmem Bfp Improving computing efficiency Potential Benefit Chart

Frontend Data Collector Frontend Data Collector Analytical Model Performance Advisor 20/26 Frontend Data Collector Analytical Model Performance Advisor #Insts Occupancy Compute Visual Profiler | Architecture-related information from H/W counters #insts, global LD/ST requests, cache info CUDA Executable Ocelot [Diamos et al., PACT 10] #SFU_Insts Instruction Analyzer (IA) Ocelot Executable | Detailed information from emulating PTX executions #SFU insts, #sync insts, loop counters ILP, MLP, ... CUDA Binary (CUBIN) Static Analysis Tools | Information in CUDA binaries instead of PTX after low-level compiler optimizations ILP, MLP The collected information is fed into our analytical model

Outline 21/26 | Motivation | GPUPerf: A Performance Analysis Framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion

Evaluation 22/26 | NVIDIA C2050 Fermi architecture | FMM (Fast Multi-pole Method): approximation of n-body problem [Winner, 2010 Gordon Bell Prize at Supercomputing] Vector Packing Prefetching SFU 44 Optimization combinations Loop Unrolling Shared Memory Loop optimization | Parboil benchmarks, Reduction (in the paper)

Results - FMM 23/26 3.5 Actual Speedup over no optimizations 3 2.5 2 1.5 1 0.5 0 pref ujam_rsqrt ujam shmem ujam_tight tight rsqrt_tight rsqrt vecpack vecpack_pref vecpack_ujam pref_rsqrt shmem_ujam vecpack_shmem shmem_trans_ujam shmem_ujam_rsqrt shmem_ujam_tight ujam_rsqrt_tight vecpack_pref_ujam shmem_tight shmem_trans shmem_rsqrt_tight shmem_trans_tight vecpack_pref_ujam_rsqrt vecpack_rsqrt shmem_rsqrt shmem_trans_rsqrt_tight shmem_pref_rsqrt_tight shmem_trans_rsqrt vecpack_pref_rsqrt vecpack_shmem_ujam_rsqrt vecpack_shmem_pref_rsqrt shmem_ujam_rsqrt_tight vecpack_shmem_rsqrt shmem_trans_ujam_tight vecpack_shmem_trans shmem_pref_ujam_rsqrt_tight shmem_trans_ujam_rsqrt_tight shmem_trans_ujam_rsqrt vecpack_shmem_pref_ujam_rsqrt vecpack_shmem_trans_rsqrt shmem_trans_pref_ujam_rsqrt_tight vecpack_shmem_trans_pref_rsqrt vecpack_shmem_trans_ujam_rsqrt Vector packing + Shared memory + Unroll-Jam + SFU combination shows the best performance Optimizations

Results - FMM 24/26 3.5 Our model follows the speed-up trend pretty well Actual Prediction Speedup over no optimizations 3 2.5 2 1.5 1 0.5 0 pref ujam_rsqrt ujam shmem ujam_tight tight rsqrt_tight rsqrt vecpack vecpack_pref vecpack_ujam pref_rsqrt shmem_ujam vecpack_shmem shmem_trans_ujam shmem_ujam_rsqrt shmem_ujam_tight ujam_rsqrt_tight vecpack_pref_ujam shmem_tight shmem_trans shmem_rsqrt_tight shmem_trans_tight vecpack_pref_ujam_rsqrt vecpack_rsqrt shmem_rsqrt shmem_trans_rsqrt_tight shmem_pref_rsqrt_tight shmem_trans_rsqrt vecpack_pref_rsqrt vecpack_shmem_ujam_rsqrt vecpack_shmem_pref_rsqrt shmem_ujam_rsqrt_tight vecpack_shmem_rsqrt shmem_trans_ujam_tight vecpack_shmem_trans shmem_pref_ujam_rsqrt_tight shmem_trans_ujam_rsqrt_tight shmem_trans_ujam_rsqrt vecpack_shmem_pref_ujam_rsqrt vecpack_shmem_trans_rsqrt shmem_trans_pref_ujam_rsqrt_tight vecpack_shmem_trans_pref_rsqrt vecpack_shmem_trans_ujam_rsqrt Our model correctly pinpoints the best optimization combination that improves the kernel Optimizations

Results Potential Benefits 25/26 3.5 0.35 Normalized Benefits 3 0.3 2.5 0.25 Speedup 2 0.2 1.5 0.15 B_fp Speedup (Actual) 1 0.1 0.5 0.05 0 0 baseline vecpack _shmem_ujam _shmem_ujam vecpack_rsqrt vecpack_rsqrt vecpack_rsqrt vecpack_rsqrt _shmem _pref Bfp : Computing ineffciency (Higher is wrose) | Bfp implies that the kernel could be improved via optimizations | Small Bfp value indicates that adding Prefetching(Pref) does not lead to further performance improvement

Conclusion 26/26 | We propose performance analysis framework. Front-end data collector, analytical model and performance advisor. | Performance advisor provides potential benefit metrics, which can guide performance tuning for GPGPU code. (Bmemlp, Bserial, Bitilp, Bfp). 44 optimization combinations in FMM are well predicted. | Future work: the performance benefit advisor can be inputs to compilers.

A Performance Analysis Framework for GPGPU Applications

Download Presentation

Presentation Transcript

Related

More Related Content