Advanced GPU Performance Modeling Techniques

undefined

GPUMech: GPU Performance Modeling Technique

based on Interval Analysis

Jen-Cheng Huang

Joo Hwan Lee

Hyesoon Kim

Hsien-Hsin S. Lee

undefined

Motivation

15 September 2024

How to find out the bottlenecks of

a hardware configuration?

Performance

TLP

Resource

contention

Register file size

undefined

Motivation

15 September 2024



Detailed timing simulation



The measured simulation slowdown is 80,000x compared to

the NVIDIA Quadro processor [

Huang ‘14



More cores integrated



Prior GPGPU analytical models



Fast



Predicting performance trend

Can we have high modeling speed without

sacrificing much accuracy?

High Error

Low resolution on

hardware bottlenecks

Slow

undefined

GPUMech

15 September 2024



Balance between accuracy and efficiency



Leverage both

functional simulation

and

analytical modeling



Use

functional simulation

to get the full execution behavior

of a warp



Use

analytical modeling

 to model the performance of GPU



Find out the sources of performance bottlenecks



Visualize the performance bottlenecks through

CPI stack

undefined

Outline

15 September 2024



Motivation



GPUMech



Evaluation



Conclusion

undefined

GPUMech

15 September 2024

1.

Warp Profiling

2.

Warp Selection

3.

Performance Modeling

4.

Bottleneck Visualization

undefined

GPUMech

15 September 2024

1.

Warp Profiling

2.

Warp Selection

3.

Performance Modeling

4.

Bottleneck Visualization

undefined



Warp profiling is to collect the interval information (execution

behavior) of a warp

Warp Profiling

15 September 2024



The memory stall events are captured by

cache simulation



Profile the entire execution of a warp



The interval concept is inspired by previous [

Karkhanis’04

],

Eyerman’09

]’s

single-threaded  CPU modeling

Issue Rate

Time

Compute

Dependency

Stall

L2 Miss Stall

(memory

dependency)

Interval 1

Interval 2

undefined

GPUMech

15 September 2024



Warp Profiling



Warp Selection



Performance Modeling



Bottleneck Visualization

undefined



Select the

representative warp

for modeling the performance of

multithreading (crucial in reducing modeling errors)



Although most warps show similar intervals, some warps may have

different intervals due to

control flow divergence

Warp Selection

15 September 2024

Interval A

Interval B

Interval C

Interval A

Interval B

Interval C

Interval A

Interval B

Interval C

W1

W3

W5

Iteration

Count

Different

Execution

Path

Representative

warp

time

undefined



Approach:

Use

K-means clustering algorithm

to select the warp



Two-dimensional vector:

warp instruction count

and

warp

performance



Two clusters



Select the warp close to the center of the largest cluster

Warp Selection

15 September 2024

Interval A

Interval B

Interval C

Interval A

Interval B

Interval C

Interval A

Interval B

Interval C

W1

W3

W5

Iteration

Count

Different

Execution

Path

Representative

warp

time

undefined

GPUMech

15 September 2024



Warp Profiling



Warp Selection



Performance Modeling



Modeling multithreading



Modeling resource contention



Bottleneck Visualization

undefined

15 September 2024



Model the scheduling policy to get the performance of multithreading

Modeling Multithreading

15 September 2024

Representative warp

 number of warps

Performance of

multithreaded hardware

Same Intervals

Modeling Round-Robin policy

Modeling Greedy-Then-Oldest policy

undefined

Modeling Round-Robin Policy

15 September 2024



A popular baseline policy for GPU architecture



The interval of a warp

Mem

Insts

undefined

Modeling Round-Robin Policy

15 September 2024

W1

W2

W3

W4

Stall cycles

Stall cycles

Stall cycles

Stall cycles

3 cycles +

6 cycles

Stall cycles

Non-hiding Instructions

Hiding Instructions

W1 stalls

undefined

Modeling Round-Robin Policy

15 September 2024

Non-hiding Instructions

Hiding Instructions

Waiting slot

Additional Issue cycles

Stall cycles

undefined

Modeling Greedy-Then-Oldest Policy

15 September 2024



Issue one warp until it stalls, then select the oldest

W1

W2

W3

W4

Stall cycles

Stall cycles

Stall cycles

Stall cycles

W1 stalls

Issue Rate

hiding Instructions

non-hiding Instructions

Stall cycles

undefined

Modeling Greedy-Then-Oldest Policy

15 September 2024

hiding Instructions

non-hiding Instructions

Stall cycles

undefined

GPUMech

15 September 2024



Warp Profiling



Warp Selection



Performance Modeling



Modeling multithreading



Modeling resource contention



Bottleneck Visualization

undefined

Modeling Resource Contention

15 September 2024



Model the resource contention to get the additional stall

cycles caused by queuing delays



In GPGPU, the

memory divergence

 is the main source of

resource contention of the memory system



E.g., Access up to 32 cache blocks for a single instruction



GPU has very limited number of MSHR entries



E.g., NVIDIA Fermi has 64 entries of MSHRs with 48 warps

Nugteren‘14



Memory divergence can congest MSHR entries incurring additional

stall cycles

undefined

Example of MSHR Contention

15 September 2024



When the memory requests are well coalesced, the number of

MSHR entries is sufficient

Warp

Stall cycles

Warp 1

Warp 2

Warp 3

Warp 4

 mem request

MSHR

entries

Coalesced

Access

In a typical GPU design, the number of MSHR entries is sufficient

if the accesses are well coalesced

undefined

Example of MSHR Contention

15 September 2024

Warp

Stall cycles

Warp 1

Warp 2

Warp 3

Warp 4

 mem requests

MSHR

entries

Uncoalesced

Access

W3 and W4 incur

additional stall cycles

Warp ID

Stall Cycles

How much queuing delay does a warp incur?

undefined

Modeling of MSHR Contention

15 September 2024

Warp ID

Stall Cycles

…

Warp Count increases,

queuing delay increases

Probabilistic Position

of mem req

Detailed

DRAM queuing delay

model is in the paper

undefined

GPUMech

15 September 2024



Warp Profiling



Warp Selection



Performance Modeling



Bottleneck Visualization

undefined

Bottleneck Visualization

15 September 2024

CPI stack

of Representative

Warp

Modeling

Multithreding

Modeling

Resource

Contention

Dependency

DRAM

latency

DRAM

queuing

delay

MSHR

queuing

delay

Issue latency

undefined

Example of Bottleneck Visualization

15 September 2024

Register file size

Increase MSHR

instead of register file

Performance Trend

Guide hardware design

Caches

ineffective

Intervals with

high memory

divergence

Similar

Cache

behavior

undefined

Outline

15 September 2024



Motivation



GPUMech



Evaluation



Conclusion

undefined

Evaluation Methodology

15 September 2024



Simulate a Fermi-like architecture (Round-robin policy)



Compare with detail timing simulation



Simulate 40 kernels from low to high MPKI from Rodinia, Parboil

and NVIDIA SDK benchmark suites

undefined

Results

15 September 2024



Average Errors =

13.2%

(max: 35.0%)



Speedup =

97x

GPUMech (MT+MSHR+DRAM)

Number of Warps

Resource contention increases

Register file size

MT+MSHR

Low

Contention

Low

Additional Cycles

GPUMech has low error across all configurations

Results of varying MSHR size, DRAM BW are in the paper

undefined

Outline

15 September 2024



Motivation



GPUMech



Evaluation



Conclusion

undefined

Conclusion

15 September 2024



Proposed the first performance modeling framework to construct

GPU CPI stacks

to identify

performance bottlenecks



GPUMech achieves



Average speed up compared to the timing simulation is

97x



13.2%

error for modeling the

round-robin

scheduling policy



14.0%

error for modeling the

greedy-then-oldest

policy



Model resource contention for MSHR and DRAM



Essential to improve the accuracy



Model can be extended to other resource components



GPUMech can be used to guide

software optimizations

as well as GPU

hardware design space explorations

undefined

15 September 2024

Thank you!

Slide Note

Hi My name is Jen-Cheng Huang.

Today I am going to talk about our GPU performance modeling framework used to find out the performance bottlenecks of a GPU architecture

This work is done with my collegue Joo hwan Lee, Prof. Hyesoon kim and profesoory hsien-hsin lee

Embed Share

Download

Explore cutting-edge techniques in GPU performance modeling, including interval analysis, resource contention identification, detailed timing simulation, and balancing accuracy with efficiency. Learn how to leverage both functional simulation and analytical modeling to pinpoint performance bottlenecks effectively. Dive into warp profiling, selection, and visualization methods for enhanced GPU performance analysis. Stay ahead in the ever-evolving landscape of GPU technology.

jbong Follow

Uploaded on Sep 15, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

GPUMech: GPU Performance Modeling Technique based on Interval Analysis Jen-Cheng Huang Joo Hwan Lee Hyesoon Kim Hsien-Hsin S. Lee

Motivation 2 Resource contention Performance Register file size How to find out the bottlenecks of a hardware configuration? 15 September 2024

Motivation 3 Detailed timing simulation Slow The measured simulation slowdown is 80,000x compared to the NVIDIA Quadro processor [Huang 14] More cores integrated High Error Prior GPGPU analytical models Fast Low resolution on hardware bottlenecks Predicting performance trend Can we have high modeling speed without sacrificing much accuracy? 15 September 2024

GPUMech 4 Balance between accuracy and efficiency Leverage both functional simulation and analytical modeling Use functional simulation to get the full execution behavior of a warp Use analytical modeling to model the performance of GPU Find out the sources of performance bottlenecks Visualize the performance bottlenecks through CPI stack 15 September 2024

Outline 5 Motivation GPUMech Evaluation Conclusion 15 September 2024

GPUMech 6 1. Warp Profiling 2. Warp Selection 3. Performance Modeling 4. Bottleneck Visualization 15 September 2024

GPUMech 7 1. Warp Profiling GPU Kernel Warp ID 2. Warp Selection #1 #2 3. Performance Modeling Time #N 4. Bottleneck Visualization Functional Simulation 15 September 2024

Warp Profiling 8 Warp profiling is to collect the interval information (execution behavior) of a warp Compute Dependency Stall L2 Miss Stall (memory dependency) Issue Rate .. Time Interval 1 Interval 2 The memory stall events are captured by cache simulation Profile the entire execution of a warp The interval concept is inspired by previous [Karkhanis 04], [Eyerman 09] s single-threaded CPU modeling 15 September 2024

GPUMech 9 Warp Profiling Warp Selection Performance Modeling Warp Selection Bottleneck Visualization 15 September 2024

Warp Selection 10 Select the representative warp for modeling the performance of multithreading (crucial in reducing modeling errors) Although most warps show similar intervals, some warps may have different intervals due to control flow divergence W2 W1 W3 W4 W5 Interval A Interval A Interval A Interval A Interval A Interval B Interval B Interval B Interval B Interval B Interval C Interval B Interval C Interval D Interval C Interval C Interval C Different Execution Path time Iteration Count Representative warp 15 September 2024

Warp Selection 11 Approach: Use K-means clustering algorithm to select the warp Two-dimensional vector: warp instruction count and warp performance Two clusters Select the warp close to the center of the largest cluster W2 W1 W3 W4 W5 Interval A Interval A Interval A Interval A Interval A Interval B Interval B Interval B Interval B Interval B Interval C Interval B Interval C Interval D Interval C Interval C Interval C Different Execution Path time Iteration Count Representative warp 15 September 2024

GPUMech 12 Warp Profiling Stall Cycles Warp Selection Scheduling Policy Performance Modeling Queuing Delay Modeling multithreading Modeling resource contention Resource Contention Bottleneck Visualization 15 September 2024

Modeling Multithreading 13 13 Model the scheduling policy to get the performance of multithreading Representative warp Modeling Round-Robin policy Modeling Greedy-Then-Oldest policy Performance of multithreaded hardware N number of warps Same Intervals 15 September 2024 15 September 2024

Modeling Round-Robin Policy 14 A popular baseline policy for GPU architecture The interval of a warp Mem Insts Issue Rate stall cycles time Interval Example: - 4 warps running on a SM - Issue rate 1 instruction/cycle 15 September 2024

Modeling Round-Robin Policy 15 W1 stalls Stall cycles W1 Stall cycles W2 Stall cycles W3 Stall cycles W4 Stall cycles 3 cycles + 6 cycles Issue Rate Hiding Instructions Non-hiding Instructions 15 September 2024 12 18+3

Modeling Round-Robin Policy 16 total_cycles_singlewarp+total_nonhiding_insts #warps total_warp_insts_singlewarp total_nonhiding_insts issue issue rates rates CPI= Additional Issue cycles #nonhiding_insts = #waiting_slots issue_prob (#warps 1) [Per interval] Waiting slot The time period between scheduling two instructions Issue probability the probability of issuing instruction when a warp is scheduled (????_????? ????_??????) Stall cycles Waiting slot Issue Rate Hiding Instructions Non-hiding Instructions 15 September 2024 12 18+3

Modeling Greedy-Then-Oldest Policy 17 Issue one warp until it stalls, then select the oldest W1 stalls Stall cycles W1 Stall cycles W2 Stall cycles W3 Stall cycles W4 Stall cycles Issue Rate hiding Instructions non-hiding Instructions 15 September 2024

Modeling Greedy-Then-Oldest Policy 18 total_cycles_singlewarp+total_nonhiding_insts #warps total_warp_insts_singlewarp total_nonhiding_insts issue issue rates rates CPI= #nonhiding_insts = min(issue_insts_in_stall num_stall_cycles, 0) [Per interval] issue_insts_in_stall =avg_interval_insts #issue_warp_in_stall #issue_warps_in_stall = issue_prob_in_stall (#warps 1) Stall cycles Issue Rate hiding Instructions non-hiding Instructions 15 September 2024

GPUMech 19 Warp Profiling Stall Cycles Warp Selection Scheduling Policy Performance Modeling Queuing Delay Modeling multithreading Modeling resource contention Resource Contention Bottleneck Visualization 15 September 2024

Modeling Resource Contention 20 Model the resource contention to get the additional stall cycles caused by queuing delays In GPGPU, the memory divergence is the main source of resource contention of the memory system E.g., Access up to 32 cache blocks for a single instruction GPU has very limited number of MSHR entries E.g., NVIDIA Fermi has 64 entries of MSHRs with 48 warps [Nugteren 14] Memory divergence can congest MSHR entries incurring additional stall cycles 15 September 2024

Example of MSHR Contention 21 1 mem request Coalesced Access Warp Stall cycles Warp 1 Warp 2 Warp 3 Req1 Req1 MSHR entries Warp 4 Req1 Req1 In a typical GPU design, the number of MSHR entries is sufficient if the accesses are well coalesced 15 September 2024 queuing_delay=avg_miss_latency

Example of MSHR Contention 22 2 mem requests Uncoalesced Access Warp Stall cycles W3 and W4 incur additional stall cycles Warp 1 Warp 2 Warp 3 Req1 Stall Cycles Req2 MSHR entries Warp 4 Req1 Warp ID Req2 3 4 1 2 How much queuing delay does a warp incur? 15 September 2024 queuing_delay=avg_miss_latency

Modeling of MSHR Contention 23 exp_queuing_delayi =avg_miss_latency (exp_occupancyi 1) Stall Cycles Exp_occupancy less than 1 No queuing delay #core_reqsi j j=1 3 4 1 2 #MSHR exp_occupancyi= #core_reqsi Warp ID Warp Count increases, queuing delay increases #core_reqsi=#warp_mem_reqi #warps Probabilistic Position of mem req Detailed DRAM queuing delay model is in the paper 15 September 2024

GPUMech 24 Warp Profiling Stall Cycles Queuing Delay CPI Warp Selection TLP Performance Modeling HW Confs Bottleneck Visualization 15 September 2024

Bottleneck Visualization 25 DRAM queuing delay DRAM latency MSHR queuing delay Dependency Modeling Resource Contention Modeling Multithreding Issue latency CPI stack of Representative Warp 15 September 2024

Example of Bottleneck Visualization 26 cfd compute_flux 1 Performance Trend Normalized CPI 0.8 DRAM queuing delay MSHR queuing delay L2 access L1 access DRAM access latency Dependencies Issue cycles Caches ineffective Guide hardware design 0.6 Intervals with high memory divergence 0.4 0.2 Similar Cache behavior 0 8 16 32 48 Number of Warps Increase MSHR instead of register file Register file size 15 September 2024

Outline 27 Motivation GPUMech Evaluation Conclusion 15 September 2024

Evaluation Methodology 28 Simulate a Fermi-like architecture (Round-robin policy) Compare with detail timing simulation Simulate 40 kernels from low to high MPKI from Rodinia, Parboil and NVIDIA SDK benchmark suites Evaluated model Description Markov_Chain Markov-chain based model [Chen 08] Na ve_interval Single_warp_IPC x # of warps MT Modeling RR Policy MT+MSHR Modeling RR Policy + MSHR GPUMech (MT+MSHR+DRAM) Modeling RR Policy + MSHR + DRAM Bandwidth 15 September 2024

Results 29 Markov_Chain MT_MSHR MT+MSHR Na ve_Interval MT_MSHR_BAND GPUMech (MT+MSHR+DRAM) MT 100% Low 80% GPUMech has low error across all configurations Contention Low Additional Cycles 60% Error Results of varying MSHR size, DRAM BW are in the paper 40% 20% 0% 8 16 32 48 Average Errors = 13.2% (max: 35.0%) Speedup = 97x Number of Warps Register file size 15 September 2024

Outline 30 Motivation GPUMech Evaluation Conclusion 15 September 2024

Conclusion 31 Proposed the first performance modeling framework to construct GPU CPI stacks to identify performance bottlenecks GPUMech achieves Average speed up compared to the timing simulation is 97x 13.2% error for modeling the round-robin scheduling policy 14.0% error for modeling the greedy-then-oldest policy Model resource contention for MSHR and DRAM Essential to improve the accuracy Model can be extended to other resource components GPUMech can be used to guide software optimizations as well as GPU hardware design space explorations 15 September 2024

32 Thank you! 15 September 2024

Advanced GPU Performance Modeling Techniques

Download Presentation

Presentation Transcript

Related

More Related Content