Core-Assisted Bottleneck Acceleration in GPUs: Maximizing Resource Utilization

Slide Note
Embed
Share

Imbalances in GPU execution lead to underutilization of resources, prompting the need for a solution like CABA (Core-Assisted Bottleneck Acceleration). This framework enables the efficient use of helper threads in GPUs, addressing memory bandwidth bottlenecks through flexible data compression. By leveraging helper threading, performance improvements of up to 41.7% have been achieved, showcasing the potential of optimizing GPU resources for various applications.


Uploaded on Sep 21, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu

  2. Executive Summary Observation: Imbalances in execution leave GPU resources underutilized Our Goal: Employ underutilized GPU resources to do something useful accelerate bottlenecks using helper threads Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture? Our Solution: CABA (Core-Assisted Bottleneck Acceleration) A new framework to enable helper threading in GPUs Enables flexible data compression to alleviate the memory bandwidth bottleneck A wide set of use cases (e.g., prefetching, memoization) Key Results: Using CABA to implement data compression in memory improves performance by 41.7% 2

  3. GPUs today are used for a wide range of applications Scientific Simulation Medical Imaging Computer Vision Data Analytics 3

  4. Challenges in GPU Efficiency Threads Cores Register File Thread 3 Idle! Thread 2 Full! Memory Hierarchy Idle! Full! Thread 1 Thread 0 GPU Streaming Multiprocessor Thread limits lead to an underutilized register file The memory bandwidth bottleneck leads to idle cores 4

  5. Motivation: Unutilized On-chip Memory 100% % Unallocated Registers 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 24% of the register file is unallocated on average Similar trends for on-chip scratchpad memory 5

  6. Motivation: Idle Pipelines 100% 80% % Cycles 60% Active Stalls 67% of cycles idle 40% 20% 0% CONS JPEG LPS MUM RAY SCP PVC PVR bfs Avg. Memory Bound 100% 80% % Cycles 60% 35% of cycles idle Active Stalls 40% 20% 0% NN STO bp hs dmr NQU SLA lc pt mc 6 Compute Bound

  7. Motivation: Summary Heterogeneous application requirements lead to: Bottlenecks in execution Idle resources 7

  8. Our Goal Use idle resources to do something useful: accelerate bottlenecks using helper threads Cores Register File Memory Hierarchy Helper threads A flexible framework to enable helper threading in GPUs: Core-Assisted Bottleneck Acceleration (CABA) 8

  9. Helper threads in GPUs Large body of work in CPUs [Chappell+ ISCA 99, MICRO 02], [Yang+ USC TR 98], [Dubois+ CF 04], [Zilles+ ISCA 01], [Collins+ ISCA 01, MICRO 01], [Aamodt+ HPCA 04], [Lu+ MICRO 05], [Luk+ ISCA 01], [Moshovos+ ICS 01], [Kamruzzaman+ ASPLOS 11], etc. However, there are new challenges with GPUs 9

  10. Challenge How do you efficiently manage and use helper threads in a throughput-oriented architecture? 10

  11. Managing Helper Threads in GPUs Thread Hardware Warp Software Block Where do we add helper threads? 11

  12. Approach #1: Software-only No hardware changes Coarse grained Regular threads Synchronization is difficult Not aware of runtime program behavior Helper threads 12

  13. Where Do We Add Helper Threads? Thread Warp Hardware Block Software 13

  14. Approach #2: Hardware-only Fine-grained control Synchronization Enforcing Priorities Warps Cores Register File CPU Core 1 Core 0 GPU Reg File 0 Reg File 0 Providing contexts efficiently is difficult Reg File 1 Reg File 1 14

  15. CABA: An Overview Tight coupling of helper threads and regular threads Efficient context management Simpler data communication SW HW Decoupled management of helper threads and regular threads Dynamic management of threads Fine-grained synchronization 15

  16. CABA: 1. In Software Helper threads: Tightly coupled to regular threads Simply instructions injected into the GPU pipelines Share the same context as the regular threads Regular threads Regs Block Helper threads Efficient context management Simpler data communication 16

  17. CABA: 2. In Hardware Helper threads: Decoupled from regular threads Tracked at the granularity of a warp Assist Warp Each regular (parent)warp can have different assist warps Dynamic management of threads Assist Warp: A Fine-grained synchronization Parent Warp: X 17 Assist Warp: B

  18. Key Functionalities Triggering and squashing assist warps Associating events with assist warps Deploying active assist warps Scheduling instructions for execution Enforcing priorities Between assist warps and parent warps Between different assist warps 18

  19. CABA: Mechanism Fetch W r i t e b a c k Stages instructions from triggered assist warps for execution D e c o d e Scoreboard ALU ALU ALU I s s u e I-Cache Helps enforce priorities Assist Warp Assist Warp Store Store Instruction Buffer Assist Warp Buffer o Triggering assist warps Buffer Holds instructions for different assist warp routines Central point of control for: Assist Warp Mem Deploy Assist Warp Controller Controller Scheduler o Squashing them Tracks progress for active assist warps Assist Warp Trigger 19

  20. Other functionality In the paper: More details on the hardware structures Data communication and synchronization Enforcing priorities 20

  21. CABA: Applications Data compression Memoization Prefetching 21

  22. A Case for CABA: Data Compression Data compression can help alleviate the memory bandwidth bottleneck - transmits data in a more condensed form Compressed Uncompressed Memory Hierarchy Idle! CABA employs idle compute pipelines to perform compression 22

  23. Data Compression with CABA Use assist warps to: Compress cache blocks before writing to memory Decompress cache blocks before placing into the cache CABA flexibly enables various compression algorithms Example: BDI Compression [Pekhimenko+ PACT 12] Parallelizable across SIMT width Low latency Others: FPC [Alameldeen+ TR 04], C-Pack [Chen+ VLSI 10] 23

  24. Walkthrough of Decompression Assist Warp Store Assist Warp Controller Trigger Scheduler L2 + Memory L1D Hit! Miss! Cores 24

  25. Walkthrough of Compression Assist Warp Store Assist Warp Controller Trigger Scheduler L2 + Memory L1D Cores 25

  26. Evaluation

  27. Methodology Simulator: GPGPUSim, GPUWattch Workloads Lonestar, Rodinia, MapReduce, CUDA SDK System Parameters 15 SMs, 32 threads/warp 48 warps/SM, 32768 registers, 32KB Shared Memory Core: 1.4GHz, GTO scheduler , 2 schedulers/SM Memory: 177.4GB/s BW, 6 GDDR5 Memory Controllers, FR-FCFS scheduling Cache: L1 - 16KB, 4-way associative; L2 - 768KB, 16-way associative Metrics Performance: Instructions per Cycle (IPC) Bandwidth Consumption: Fraction of cycles the DRAM data bus is busy 27

  28. Effect on Performance 2.8 Normalized Performance 2.6 2.4 2.2 2 1.8 41.7% 1.6 1.4 1.2 1 CABA-BDI No-Overhead-BDI CABA provides a 41.7% performance improvement CABA achieves performance close to that of designs with no overhead for compression 28

  29. Effect on Bandwidth Consumption 90% Memory Bandwidth Consumption 80% 70% 60% 50% 40% 30% 20% 10% 0% Baseline CABA-BDI Data compression with CABA alleviates the memory bandwidth bottleneck 29

  30. Different Compression Algorithms 2.8 Normalized Performance 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 CABA-FPC CABA-BDI CABA-CPack CABA-BestOfAll CABA is flexible: Improves performance with different compression algorithms 30

  31. Other Results CABA s performance is similar to pure-hardware based BDI compression CABA reduces the overall system energy (22%) by decreasing the off-chip memory traffic Other evaluations: Compression ratios Sensitivity to memory bandwidth Capacity compression Compression at different levels of the hierarchy 31

  32. Conclusion Observation: Imbalances in execution leave GPU resources underutilized Our Goal: Employ underutilized GPU resources to do something useful accelerate bottlenecks using helper threads Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture? Our Solution: CABA (Core-Assisted Bottleneck Acceleration) A new framework to enable helper threading in GPUs Enables flexible data compression to alleviate the memory bandwidth bottleneck A wide set of use cases (e.g., prefetching, memoization) Key Results: Using CABA to implement data compression in memory improves performance by 41.7% 32

  33. A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu

  34. 34 Backup Slides

  35. Effect on Energy 1.2 Normalized Energy 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 CABA-BDI Ideal-BDI HW-BDI-Mem HW-BDI CABA reduces the overall system energy by decreasing the off-chip memory traffic 35

  36. Effect on Compression Ratio 36

  37. Other Uses of CABA Hardware Memoization Goal: avoid redundant computation by reusing previous results over the same/similar inputs Idea: hash the inputs at predefined points use load/store pipelines to save inputs in shared memory eliminate redundant computation by loading stored results Prefetching Similar to CPU 37

Related


More Related Content