Advanced GPU Performance Modeling Techniques

Slide Note
Embed
Share

Explore cutting-edge techniques in GPU performance modeling, including interval analysis, resource contention identification, detailed timing simulation, and balancing accuracy with efficiency. Learn how to leverage both functional simulation and analytical modeling to pinpoint performance bottlenecks effectively. Dive into warp profiling, selection, and visualization methods for enhanced GPU performance analysis. Stay ahead in the ever-evolving landscape of GPU technology.


Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. GPUMech: GPU Performance Modeling Technique based on Interval Analysis Jen-Cheng Huang Joo Hwan Lee Hyesoon Kim Hsien-Hsin S. Lee

  2. Motivation 2 Resource contention Performance Register file size How to find out the bottlenecks of a hardware configuration? 15 September 2024

  3. Motivation 3 Detailed timing simulation Slow The measured simulation slowdown is 80,000x compared to the NVIDIA Quadro processor [Huang 14] More cores integrated High Error Prior GPGPU analytical models Fast Low resolution on hardware bottlenecks Predicting performance trend Can we have high modeling speed without sacrificing much accuracy? 15 September 2024

  4. GPUMech 4 Balance between accuracy and efficiency Leverage both functional simulation and analytical modeling Use functional simulation to get the full execution behavior of a warp Use analytical modeling to model the performance of GPU Find out the sources of performance bottlenecks Visualize the performance bottlenecks through CPI stack 15 September 2024

  5. Outline 5 Motivation GPUMech Evaluation Conclusion 15 September 2024

  6. GPUMech 6 1. Warp Profiling 2. Warp Selection 3. Performance Modeling 4. Bottleneck Visualization 15 September 2024

  7. GPUMech 7 1. Warp Profiling GPU Kernel Warp ID 2. Warp Selection #1 #2 3. Performance Modeling Time #N 4. Bottleneck Visualization Functional Simulation 15 September 2024

  8. Warp Profiling 8 Warp profiling is to collect the interval information (execution behavior) of a warp Compute Dependency Stall L2 Miss Stall (memory dependency) Issue Rate .. Time Interval 1 Interval 2 The memory stall events are captured by cache simulation Profile the entire execution of a warp The interval concept is inspired by previous [Karkhanis 04], [Eyerman 09] s single-threaded CPU modeling 15 September 2024

  9. GPUMech 9 Warp Profiling Warp Selection Performance Modeling Warp Selection Bottleneck Visualization 15 September 2024

  10. Warp Selection 10 Select the representative warp for modeling the performance of multithreading (crucial in reducing modeling errors) Although most warps show similar intervals, some warps may have different intervals due to control flow divergence W2 W1 W3 W4 W5 Interval A Interval A Interval A Interval A Interval A Interval B Interval B Interval B Interval B Interval B Interval C Interval B Interval C Interval D Interval C Interval C Interval C Different Execution Path time Iteration Count Representative warp 15 September 2024

  11. Warp Selection 11 Approach: Use K-means clustering algorithm to select the warp Two-dimensional vector: warp instruction count and warp performance Two clusters Select the warp close to the center of the largest cluster W2 W1 W3 W4 W5 Interval A Interval A Interval A Interval A Interval A Interval B Interval B Interval B Interval B Interval B Interval C Interval B Interval C Interval D Interval C Interval C Interval C Different Execution Path time Iteration Count Representative warp 15 September 2024

  12. GPUMech 12 Warp Profiling Stall Cycles Warp Selection Scheduling Policy Performance Modeling Queuing Delay Modeling multithreading Modeling resource contention Resource Contention Bottleneck Visualization 15 September 2024

  13. Modeling Multithreading 13 13 Model the scheduling policy to get the performance of multithreading Representative warp Modeling Round-Robin policy Modeling Greedy-Then-Oldest policy Performance of multithreaded hardware N number of warps Same Intervals 15 September 2024 15 September 2024

  14. Modeling Round-Robin Policy 14 A popular baseline policy for GPU architecture The interval of a warp Mem Insts Issue Rate stall cycles time Interval Example: - 4 warps running on a SM - Issue rate 1 instruction/cycle 15 September 2024

  15. Modeling Round-Robin Policy 15 W1 stalls Stall cycles W1 Stall cycles W2 Stall cycles W3 Stall cycles W4 Stall cycles 3 cycles + 6 cycles Issue Rate Hiding Instructions Non-hiding Instructions 15 September 2024 12 18+3

  16. Modeling Round-Robin Policy 16 total_cycles_singlewarp+total_nonhiding_insts #warps total_warp_insts_singlewarp total_nonhiding_insts issue issue rates rates CPI= Additional Issue cycles #nonhiding_insts = #waiting_slots issue_prob (#warps 1) [Per interval] Waiting slot The time period between scheduling two instructions Issue probability the probability of issuing instruction when a warp is scheduled (????_????? ????_??????) Stall cycles Waiting slot Issue Rate Hiding Instructions Non-hiding Instructions 15 September 2024 12 18+3

  17. Modeling Greedy-Then-Oldest Policy 17 Issue one warp until it stalls, then select the oldest W1 stalls Stall cycles W1 Stall cycles W2 Stall cycles W3 Stall cycles W4 Stall cycles Issue Rate hiding Instructions non-hiding Instructions 15 September 2024

  18. Modeling Greedy-Then-Oldest Policy 18 total_cycles_singlewarp+total_nonhiding_insts #warps total_warp_insts_singlewarp total_nonhiding_insts issue issue rates rates CPI= #nonhiding_insts = min(issue_insts_in_stall num_stall_cycles, 0) [Per interval] issue_insts_in_stall =avg_interval_insts #issue_warp_in_stall #issue_warps_in_stall = issue_prob_in_stall (#warps 1) Stall cycles Issue Rate hiding Instructions non-hiding Instructions 15 September 2024

  19. GPUMech 19 Warp Profiling Stall Cycles Warp Selection Scheduling Policy Performance Modeling Queuing Delay Modeling multithreading Modeling resource contention Resource Contention Bottleneck Visualization 15 September 2024

  20. Modeling Resource Contention 20 Model the resource contention to get the additional stall cycles caused by queuing delays In GPGPU, the memory divergence is the main source of resource contention of the memory system E.g., Access up to 32 cache blocks for a single instruction GPU has very limited number of MSHR entries E.g., NVIDIA Fermi has 64 entries of MSHRs with 48 warps [Nugteren 14] Memory divergence can congest MSHR entries incurring additional stall cycles 15 September 2024

  21. Example of MSHR Contention 21 1 mem request Coalesced Access Warp Stall cycles Warp 1 Warp 2 Warp 3 Req1 Req1 MSHR entries Warp 4 Req1 Req1 In a typical GPU design, the number of MSHR entries is sufficient if the accesses are well coalesced 15 September 2024 queuing_delay=avg_miss_latency

  22. Example of MSHR Contention 22 2 mem requests Uncoalesced Access Warp Stall cycles W3 and W4 incur additional stall cycles Warp 1 Warp 2 Warp 3 Req1 Stall Cycles Req2 MSHR entries Warp 4 Req1 Warp ID Req2 3 4 1 2 How much queuing delay does a warp incur? 15 September 2024 queuing_delay=avg_miss_latency

  23. Modeling of MSHR Contention 23 exp_queuing_delayi =avg_miss_latency (exp_occupancyi 1) Stall Cycles Exp_occupancy less than 1 No queuing delay #core_reqsi j j=1 3 4 1 2 #MSHR exp_occupancyi= #core_reqsi Warp ID Warp Count increases, queuing delay increases #core_reqsi=#warp_mem_reqi #warps Probabilistic Position of mem req Detailed DRAM queuing delay model is in the paper 15 September 2024

  24. GPUMech 24 Warp Profiling Stall Cycles Queuing Delay CPI Warp Selection TLP Performance Modeling HW Confs Bottleneck Visualization 15 September 2024

  25. Bottleneck Visualization 25 DRAM queuing delay DRAM latency MSHR queuing delay Dependency Modeling Resource Contention Modeling Multithreding Issue latency CPI stack of Representative Warp 15 September 2024

  26. Example of Bottleneck Visualization 26 cfd compute_flux 1 Performance Trend Normalized CPI 0.8 DRAM queuing delay MSHR queuing delay L2 access L1 access DRAM access latency Dependencies Issue cycles Caches ineffective Guide hardware design 0.6 Intervals with high memory divergence 0.4 0.2 Similar Cache behavior 0 8 16 32 48 Number of Warps Increase MSHR instead of register file Register file size 15 September 2024

  27. Outline 27 Motivation GPUMech Evaluation Conclusion 15 September 2024

  28. Evaluation Methodology 28 Simulate a Fermi-like architecture (Round-robin policy) Compare with detail timing simulation Simulate 40 kernels from low to high MPKI from Rodinia, Parboil and NVIDIA SDK benchmark suites Evaluated model Description Markov_Chain Markov-chain based model [Chen 08] Na ve_interval Single_warp_IPC x # of warps MT Modeling RR Policy MT+MSHR Modeling RR Policy + MSHR GPUMech (MT+MSHR+DRAM) Modeling RR Policy + MSHR + DRAM Bandwidth 15 September 2024

  29. Results 29 Markov_Chain MT_MSHR MT+MSHR Na ve_Interval MT_MSHR_BAND GPUMech (MT+MSHR+DRAM) MT 100% Low 80% GPUMech has low error across all configurations Contention Low Additional Cycles 60% Error Results of varying MSHR size, DRAM BW are in the paper 40% 20% 0% 8 16 32 48 Average Errors = 13.2% (max: 35.0%) Speedup = 97x Number of Warps Register file size 15 September 2024

  30. Outline 30 Motivation GPUMech Evaluation Conclusion 15 September 2024

  31. Conclusion 31 Proposed the first performance modeling framework to construct GPU CPI stacks to identify performance bottlenecks GPUMech achieves Average speed up compared to the timing simulation is 97x 13.2% error for modeling the round-robin scheduling policy 14.0% error for modeling the greedy-then-oldest policy Model resource contention for MSHR and DRAM Essential to improve the accuracy Model can be extended to other resource components GPUMech can be used to guide software optimizations as well as GPU hardware design space explorations 15 September 2024

  32. 32 Thank you! 15 September 2024

Related


More Related Content