Advanced GPU Performance Modeling Techniques

undefined
GPUMech: GPU Performance Modeling Technique
based on Interval Analysis
 
Jen-Cheng Huang
Joo Hwan Lee
Hyesoon Kim
Hsien-Hsin S. Lee
undefined
Motivation
15 September 2024
2
 
 
How to find out the bottlenecks of
a hardware configuration?
Performance
TLP
 
Resource
contention
Register file size
undefined
Motivation
15 September 2024
3
 
Detailed timing simulation
The measured simulation slowdown is 80,000x compared to
the NVIDIA Quadro processor [
Huang ‘14
]
More cores integrated
 
Prior GPGPU analytical models
Fast
Predicting performance trend
 
Can we have high modeling speed without
sacrificing much accuracy?
High Error
Low resolution on
hardware bottlenecks
Slow
undefined
GPUMech
15 September 2024
4
 
Balance between accuracy and efficiency
Leverage both 
functional simulation 
and 
analytical modeling
Use 
functional simulation 
to get the full execution behavior
of a warp
Use 
analytical modeling
 to model the performance of GPU
 
 
 
Find out the sources of performance bottlenecks
Visualize the performance bottlenecks through 
CPI stack
 
undefined
 
Outline
 
15 September 2024
 
5
 
Motivation
 
GPUMech
 
Evaluation
 
Conclusion
 
undefined
 
GPUMech
 
15 September 2024
 
6
 
1.
Warp Profiling
 
2.
Warp Selection
 
3.
Performance Modeling
 
4.
Bottleneck Visualization
 
undefined
 
GPUMech
 
15 September 2024
 
7
 
1.
Warp Profiling
 
2.
Warp Selection
 
3.
Performance Modeling
 
4.
Bottleneck Visualization
 
undefined
Warp profiling is to collect the interval information (execution
behavior) of a warp
Warp Profiling
15 September 2024
8
 
The memory stall events are captured by 
cache simulation
Profile the entire execution of a warp
The interval concept is inspired by previous [
Karkhanis’04
],
[
Eyerman’09
]’s 
single-threaded  CPU modeling
 
 
 
 
 
Issue Rate
 
Time
 
Compute
Dependency
Stall
 
L2 Miss Stall
(memory
dependency)
 
Interval 1
 
Interval 2
undefined
 
GPUMech
 
15 September 2024
 
9
 
Warp Profiling
 
Warp Selection
 
Performance Modeling
 
Bottleneck Visualization
 
undefined
 
Select the 
representative warp 
for modeling the performance of
multithreading (crucial in reducing modeling errors)
Although most warps show similar intervals, some warps may have
different intervals due to 
control flow divergence
 
 
 
 
 
 
 
 
 
 
 
 
Warp Selection
15 September 2024
10
Interval A
Interval B
Interval C
Interval A
Interval B
Interval C
Interval A
Interval B
Interval C
 
W1
 
W3
 
W5
 
Iteration
Count
 
Different
Execution
Path
 
Representative
warp
 
 
time
undefined
 
Approach: 
Use 
K-means clustering algorithm 
to select the warp
Two-dimensional vector: 
warp instruction count 
and 
warp
performance
Two clusters
Select the warp close to the center of the largest cluster
 
 
 
 
 
 
 
 
 
 
 
 
Warp Selection
15 September 2024
11
Interval A
Interval B
Interval C
Interval A
Interval B
Interval C
Interval A
Interval B
Interval C
W1
W3
W5
Iteration
Count
Different
Execution
Path
Representative
warp
 
time
undefined
 
GPUMech
 
15 September 2024
 
12
 
Warp Profiling
 
Warp Selection
 
Performance Modeling
Modeling multithreading
Modeling resource contention
 
Bottleneck Visualization
 
undefined
15 September 2024
13
Model the scheduling policy to get the performance of multithreading
 
Modeling Multithreading
15 September 2024
13
 
Representative warp
 
N
 number of warps
 
Performance of
multithreaded hardware
Same Intervals
Modeling Round-Robin policy
Modeling Greedy-Then-Oldest policy
undefined
Modeling Round-Robin Policy
15 September 2024
14
 
 
 
A popular baseline policy for GPU architecture
The interval of a warp
Mem
Insts
undefined
Modeling Round-Robin Policy
15 September 2024
15
 
W1
W2
W3
W4
Stall cycles
Stall cycles
Stall cycles
Stall cycles
 
3 cycles + 
6 cycles
 
Stall cycles
 
Non-hiding Instructions
 
Hiding Instructions
W1 stalls
undefined
 
Modeling Round-Robin Policy
15 September 2024
16
Non-hiding Instructions
Hiding Instructions
 
Waiting slot
 
Additional Issue cycles
Stall cycles
undefined
Modeling Greedy-Then-Oldest Policy
15 September 2024
17
Issue one warp until it stalls, then select the oldest
 
W1
W2
W3
W4
Stall cycles
Stall cycles
Stall cycles
Stall cycles
W1 stalls
 
Issue Rate
 
hiding Instructions
 
non-hiding Instructions
 
Stall cycles
undefined
Modeling Greedy-Then-Oldest Policy
15 September 2024
18
 
 
hiding Instructions
non-hiding Instructions
Stall cycles
undefined
 
GPUMech
 
15 September 2024
 
19
 
Warp Profiling
 
Warp Selection
 
Performance Modeling
Modeling multithreading
Modeling resource contention
 
Bottleneck Visualization
 
undefined
Modeling Resource Contention
15 September 2024
20
 
Model the resource contention to get the additional stall
cycles caused by queuing delays
 
In GPGPU, the 
memory divergence
 is the main source of
resource contention of the memory system
E.g., Access up to 32 cache blocks for a single instruction
 
GPU has very limited number of MSHR entries
E.g., NVIDIA Fermi has 64 entries of MSHRs with 48 warps
[
Nugteren‘14
]
 
 
Memory divergence can congest MSHR entries incurring additional
stall cycles
 
 
undefined
Example of MSHR Contention
15 September 2024
21
When the memory requests are well coalesced, the number of
MSHR entries is sufficient
Warp
Stall cycles
Warp 1
Warp 2
Warp 3
Warp 4
1
 mem request
 
MSHR
entries
Coalesced
Access
In a typical GPU design, the number of MSHR entries is sufficient
if the accesses are well coalesced
undefined
 
Example of MSHR Contention
15 September 2024
22
Warp
Stall cycles
Warp 1
Warp 2
Warp 3
Warp 4
2
 mem requests
 
MSHR
entries
Uncoalesced
Access
W3 and W4 incur
additional stall cycles
 
Warp ID
 
Stall Cycles
 
1
 
2
 
3
 
4
How much queuing delay does a warp incur?
undefined
Modeling of MSHR Contention
15 September 2024
23
 
Warp ID
Stall Cycles
1
2
3
4
Warp Count increases,
queuing delay increases
Probabilistic Position
of mem req
 
Detailed 
DRAM queuing delay 
model is in the paper
undefined
 
GPUMech
 
15 September 2024
 
24
 
Warp Profiling
 
Warp Selection
 
Performance Modeling
 
Bottleneck Visualization
 
undefined
Bottleneck Visualization
15 September 2024
25
 
CPI stack
of Representative
Warp
 
Modeling
Multithreding
 
Modeling
Resource
Contention
Dependency
DRAM
latency
 
DRAM
queuing
delay
MSHR
queuing
delay
Issue latency
undefined
Example of Bottleneck Visualization
15 September 2024
26
 
 
Register file size
Increase MSHR
instead of register file
Performance Trend
Guide hardware design
Caches
ineffective
Intervals with
high memory
divergence
Similar
Cache
behavior
undefined
 
Outline
 
15 September 2024
 
27
 
Motivation
 
GPUMech
 
Evaluation
 
Conclusion
 
undefined
Evaluation Methodology
15 September 2024
28
 
Simulate a Fermi-like architecture (Round-robin policy)
 
Compare with detail timing simulation
 
Simulate 40 kernels from low to high MPKI from Rodinia, Parboil
and NVIDIA SDK benchmark suites
undefined
Results
15 September 2024
29
 
 
 
 
 
 
 
 
 
 
 
 
Average Errors = 
13.2% 
(max: 35.0%)
Speedup = 
97x
 
 
GPUMech (MT+MSHR+DRAM)
Number of Warps
 
Resource contention increases
Register file size
MT+MSHR
Low
Contention
Low
Additional Cycles
GPUMech has low error across all configurations
Results of varying MSHR size, DRAM BW are in the paper
undefined
 
Outline
 
15 September 2024
 
30
 
Motivation
 
GPUMech
 
Evaluation
 
Conclusion
 
undefined
Conclusion
15 September 2024
31
 
Proposed the first performance modeling framework to construct
GPU CPI stacks 
to identify 
performance bottlenecks
 
GPUMech achieves
Average speed up compared to the timing simulation is 
97x
13.2% 
error for modeling the 
round-robin 
scheduling policy
14.0% 
error for modeling the 
greedy-then-oldest 
policy
 
Model resource contention for MSHR and DRAM
Essential to improve the accuracy
Model can be extended to other resource components
 
GPUMech can be used to guide 
software optimizations 
as well as GPU
hardware design space explorations
 
 
 
undefined
 
 
15 September 2024
 
32
 
 
 
 
 
Thank you!
 
Slide Note

Hi My name is Jen-Cheng Huang.

Today I am going to talk about our GPU performance modeling framework used to find out the performance bottlenecks of a GPU architecture

This work is done with my collegue Joo hwan Lee, Prof. Hyesoon kim and profesoory hsien-hsin lee

Embed
Share

Explore cutting-edge techniques in GPU performance modeling, including interval analysis, resource contention identification, detailed timing simulation, and balancing accuracy with efficiency. Learn how to leverage both functional simulation and analytical modeling to pinpoint performance bottlenecks effectively. Dive into warp profiling, selection, and visualization methods for enhanced GPU performance analysis. Stay ahead in the ever-evolving landscape of GPU technology.


Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. GPUMech: GPU Performance Modeling Technique based on Interval Analysis Jen-Cheng Huang Joo Hwan Lee Hyesoon Kim Hsien-Hsin S. Lee

  2. Motivation 2 Resource contention Performance Register file size How to find out the bottlenecks of a hardware configuration? 15 September 2024

  3. Motivation 3 Detailed timing simulation Slow The measured simulation slowdown is 80,000x compared to the NVIDIA Quadro processor [Huang 14] More cores integrated High Error Prior GPGPU analytical models Fast Low resolution on hardware bottlenecks Predicting performance trend Can we have high modeling speed without sacrificing much accuracy? 15 September 2024

  4. GPUMech 4 Balance between accuracy and efficiency Leverage both functional simulation and analytical modeling Use functional simulation to get the full execution behavior of a warp Use analytical modeling to model the performance of GPU Find out the sources of performance bottlenecks Visualize the performance bottlenecks through CPI stack 15 September 2024

  5. Outline 5 Motivation GPUMech Evaluation Conclusion 15 September 2024

  6. GPUMech 6 1. Warp Profiling 2. Warp Selection 3. Performance Modeling 4. Bottleneck Visualization 15 September 2024

  7. GPUMech 7 1. Warp Profiling GPU Kernel Warp ID 2. Warp Selection #1 #2 3. Performance Modeling Time #N 4. Bottleneck Visualization Functional Simulation 15 September 2024

  8. Warp Profiling 8 Warp profiling is to collect the interval information (execution behavior) of a warp Compute Dependency Stall L2 Miss Stall (memory dependency) Issue Rate .. Time Interval 1 Interval 2 The memory stall events are captured by cache simulation Profile the entire execution of a warp The interval concept is inspired by previous [Karkhanis 04], [Eyerman 09] s single-threaded CPU modeling 15 September 2024

  9. GPUMech 9 Warp Profiling Warp Selection Performance Modeling Warp Selection Bottleneck Visualization 15 September 2024

  10. Warp Selection 10 Select the representative warp for modeling the performance of multithreading (crucial in reducing modeling errors) Although most warps show similar intervals, some warps may have different intervals due to control flow divergence W2 W1 W3 W4 W5 Interval A Interval A Interval A Interval A Interval A Interval B Interval B Interval B Interval B Interval B Interval C Interval B Interval C Interval D Interval C Interval C Interval C Different Execution Path time Iteration Count Representative warp 15 September 2024

  11. Warp Selection 11 Approach: Use K-means clustering algorithm to select the warp Two-dimensional vector: warp instruction count and warp performance Two clusters Select the warp close to the center of the largest cluster W2 W1 W3 W4 W5 Interval A Interval A Interval A Interval A Interval A Interval B Interval B Interval B Interval B Interval B Interval C Interval B Interval C Interval D Interval C Interval C Interval C Different Execution Path time Iteration Count Representative warp 15 September 2024

  12. GPUMech 12 Warp Profiling Stall Cycles Warp Selection Scheduling Policy Performance Modeling Queuing Delay Modeling multithreading Modeling resource contention Resource Contention Bottleneck Visualization 15 September 2024

  13. Modeling Multithreading 13 13 Model the scheduling policy to get the performance of multithreading Representative warp Modeling Round-Robin policy Modeling Greedy-Then-Oldest policy Performance of multithreaded hardware N number of warps Same Intervals 15 September 2024 15 September 2024

  14. Modeling Round-Robin Policy 14 A popular baseline policy for GPU architecture The interval of a warp Mem Insts Issue Rate stall cycles time Interval Example: - 4 warps running on a SM - Issue rate 1 instruction/cycle 15 September 2024

  15. Modeling Round-Robin Policy 15 W1 stalls Stall cycles W1 Stall cycles W2 Stall cycles W3 Stall cycles W4 Stall cycles 3 cycles + 6 cycles Issue Rate Hiding Instructions Non-hiding Instructions 15 September 2024 12 18+3

  16. Modeling Round-Robin Policy 16 total_cycles_singlewarp+total_nonhiding_insts #warps total_warp_insts_singlewarp total_nonhiding_insts issue issue rates rates CPI= Additional Issue cycles #nonhiding_insts = #waiting_slots issue_prob (#warps 1) [Per interval] Waiting slot The time period between scheduling two instructions Issue probability the probability of issuing instruction when a warp is scheduled (????_????? ????_??????) Stall cycles Waiting slot Issue Rate Hiding Instructions Non-hiding Instructions 15 September 2024 12 18+3

  17. Modeling Greedy-Then-Oldest Policy 17 Issue one warp until it stalls, then select the oldest W1 stalls Stall cycles W1 Stall cycles W2 Stall cycles W3 Stall cycles W4 Stall cycles Issue Rate hiding Instructions non-hiding Instructions 15 September 2024

  18. Modeling Greedy-Then-Oldest Policy 18 total_cycles_singlewarp+total_nonhiding_insts #warps total_warp_insts_singlewarp total_nonhiding_insts issue issue rates rates CPI= #nonhiding_insts = min(issue_insts_in_stall num_stall_cycles, 0) [Per interval] issue_insts_in_stall =avg_interval_insts #issue_warp_in_stall #issue_warps_in_stall = issue_prob_in_stall (#warps 1) Stall cycles Issue Rate hiding Instructions non-hiding Instructions 15 September 2024

  19. GPUMech 19 Warp Profiling Stall Cycles Warp Selection Scheduling Policy Performance Modeling Queuing Delay Modeling multithreading Modeling resource contention Resource Contention Bottleneck Visualization 15 September 2024

  20. Modeling Resource Contention 20 Model the resource contention to get the additional stall cycles caused by queuing delays In GPGPU, the memory divergence is the main source of resource contention of the memory system E.g., Access up to 32 cache blocks for a single instruction GPU has very limited number of MSHR entries E.g., NVIDIA Fermi has 64 entries of MSHRs with 48 warps [Nugteren 14] Memory divergence can congest MSHR entries incurring additional stall cycles 15 September 2024

  21. Example of MSHR Contention 21 1 mem request Coalesced Access Warp Stall cycles Warp 1 Warp 2 Warp 3 Req1 Req1 MSHR entries Warp 4 Req1 Req1 In a typical GPU design, the number of MSHR entries is sufficient if the accesses are well coalesced 15 September 2024 queuing_delay=avg_miss_latency

  22. Example of MSHR Contention 22 2 mem requests Uncoalesced Access Warp Stall cycles W3 and W4 incur additional stall cycles Warp 1 Warp 2 Warp 3 Req1 Stall Cycles Req2 MSHR entries Warp 4 Req1 Warp ID Req2 3 4 1 2 How much queuing delay does a warp incur? 15 September 2024 queuing_delay=avg_miss_latency

  23. Modeling of MSHR Contention 23 exp_queuing_delayi =avg_miss_latency (exp_occupancyi 1) Stall Cycles Exp_occupancy less than 1 No queuing delay #core_reqsi j j=1 3 4 1 2 #MSHR exp_occupancyi= #core_reqsi Warp ID Warp Count increases, queuing delay increases #core_reqsi=#warp_mem_reqi #warps Probabilistic Position of mem req Detailed DRAM queuing delay model is in the paper 15 September 2024

  24. GPUMech 24 Warp Profiling Stall Cycles Queuing Delay CPI Warp Selection TLP Performance Modeling HW Confs Bottleneck Visualization 15 September 2024

  25. Bottleneck Visualization 25 DRAM queuing delay DRAM latency MSHR queuing delay Dependency Modeling Resource Contention Modeling Multithreding Issue latency CPI stack of Representative Warp 15 September 2024

  26. Example of Bottleneck Visualization 26 cfd compute_flux 1 Performance Trend Normalized CPI 0.8 DRAM queuing delay MSHR queuing delay L2 access L1 access DRAM access latency Dependencies Issue cycles Caches ineffective Guide hardware design 0.6 Intervals with high memory divergence 0.4 0.2 Similar Cache behavior 0 8 16 32 48 Number of Warps Increase MSHR instead of register file Register file size 15 September 2024

  27. Outline 27 Motivation GPUMech Evaluation Conclusion 15 September 2024

  28. Evaluation Methodology 28 Simulate a Fermi-like architecture (Round-robin policy) Compare with detail timing simulation Simulate 40 kernels from low to high MPKI from Rodinia, Parboil and NVIDIA SDK benchmark suites Evaluated model Description Markov_Chain Markov-chain based model [Chen 08] Na ve_interval Single_warp_IPC x # of warps MT Modeling RR Policy MT+MSHR Modeling RR Policy + MSHR GPUMech (MT+MSHR+DRAM) Modeling RR Policy + MSHR + DRAM Bandwidth 15 September 2024

  29. Results 29 Markov_Chain MT_MSHR MT+MSHR Na ve_Interval MT_MSHR_BAND GPUMech (MT+MSHR+DRAM) MT 100% Low 80% GPUMech has low error across all configurations Contention Low Additional Cycles 60% Error Results of varying MSHR size, DRAM BW are in the paper 40% 20% 0% 8 16 32 48 Average Errors = 13.2% (max: 35.0%) Speedup = 97x Number of Warps Register file size 15 September 2024

  30. Outline 30 Motivation GPUMech Evaluation Conclusion 15 September 2024

  31. Conclusion 31 Proposed the first performance modeling framework to construct GPU CPI stacks to identify performance bottlenecks GPUMech achieves Average speed up compared to the timing simulation is 97x 13.2% error for modeling the round-robin scheduling policy 14.0% error for modeling the greedy-then-oldest policy Model resource contention for MSHR and DRAM Essential to improve the accuracy Model can be extended to other resource components GPUMech can be used to guide software optimizations as well as GPU hardware design space explorations 15 September 2024

  32. 32 Thank you! 15 September 2024

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#