Core-Assisted Bottleneck Acceleration in GPUs: Maximizing Resource Utilization
Imbalances in GPU execution lead to underutilization of resources, prompting the need for a solution like CABA (Core-Assisted Bottleneck Acceleration). This framework enables the efficient use of helper threads in GPUs, addressing memory bandwidth bottlenecks through flexible data compression. By leveraging helper threading, performance improvements of up to 41.7% have been achieved, showcasing the potential of optimizing GPU resources for various applications.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu
Executive Summary Observation: Imbalances in execution leave GPU resources underutilized Our Goal: Employ underutilized GPU resources to do something useful accelerate bottlenecks using helper threads Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture? Our Solution: CABA (Core-Assisted Bottleneck Acceleration) A new framework to enable helper threading in GPUs Enables flexible data compression to alleviate the memory bandwidth bottleneck A wide set of use cases (e.g., prefetching, memoization) Key Results: Using CABA to implement data compression in memory improves performance by 41.7% 2
GPUs today are used for a wide range of applications Scientific Simulation Medical Imaging Computer Vision Data Analytics 3
Challenges in GPU Efficiency Threads Cores Register File Thread 3 Idle! Thread 2 Full! Memory Hierarchy Idle! Full! Thread 1 Thread 0 GPU Streaming Multiprocessor Thread limits lead to an underutilized register file The memory bandwidth bottleneck leads to idle cores 4
Motivation: Unutilized On-chip Memory 100% % Unallocated Registers 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 24% of the register file is unallocated on average Similar trends for on-chip scratchpad memory 5
Motivation: Idle Pipelines 100% 80% % Cycles 60% Active Stalls 67% of cycles idle 40% 20% 0% CONS JPEG LPS MUM RAY SCP PVC PVR bfs Avg. Memory Bound 100% 80% % Cycles 60% 35% of cycles idle Active Stalls 40% 20% 0% NN STO bp hs dmr NQU SLA lc pt mc 6 Compute Bound
Motivation: Summary Heterogeneous application requirements lead to: Bottlenecks in execution Idle resources 7
Our Goal Use idle resources to do something useful: accelerate bottlenecks using helper threads Cores Register File Memory Hierarchy Helper threads A flexible framework to enable helper threading in GPUs: Core-Assisted Bottleneck Acceleration (CABA) 8
Helper threads in GPUs Large body of work in CPUs [Chappell+ ISCA 99, MICRO 02], [Yang+ USC TR 98], [Dubois+ CF 04], [Zilles+ ISCA 01], [Collins+ ISCA 01, MICRO 01], [Aamodt+ HPCA 04], [Lu+ MICRO 05], [Luk+ ISCA 01], [Moshovos+ ICS 01], [Kamruzzaman+ ASPLOS 11], etc. However, there are new challenges with GPUs 9
Challenge How do you efficiently manage and use helper threads in a throughput-oriented architecture? 10
Managing Helper Threads in GPUs Thread Hardware Warp Software Block Where do we add helper threads? 11
Approach #1: Software-only No hardware changes Coarse grained Regular threads Synchronization is difficult Not aware of runtime program behavior Helper threads 12
Where Do We Add Helper Threads? Thread Warp Hardware Block Software 13
Approach #2: Hardware-only Fine-grained control Synchronization Enforcing Priorities Warps Cores Register File CPU Core 1 Core 0 GPU Reg File 0 Reg File 0 Providing contexts efficiently is difficult Reg File 1 Reg File 1 14
CABA: An Overview Tight coupling of helper threads and regular threads Efficient context management Simpler data communication SW HW Decoupled management of helper threads and regular threads Dynamic management of threads Fine-grained synchronization 15
CABA: 1. In Software Helper threads: Tightly coupled to regular threads Simply instructions injected into the GPU pipelines Share the same context as the regular threads Regular threads Regs Block Helper threads Efficient context management Simpler data communication 16
CABA: 2. In Hardware Helper threads: Decoupled from regular threads Tracked at the granularity of a warp Assist Warp Each regular (parent)warp can have different assist warps Dynamic management of threads Assist Warp: A Fine-grained synchronization Parent Warp: X 17 Assist Warp: B
Key Functionalities Triggering and squashing assist warps Associating events with assist warps Deploying active assist warps Scheduling instructions for execution Enforcing priorities Between assist warps and parent warps Between different assist warps 18
CABA: Mechanism Fetch W r i t e b a c k Stages instructions from triggered assist warps for execution D e c o d e Scoreboard ALU ALU ALU I s s u e I-Cache Helps enforce priorities Assist Warp Assist Warp Store Store Instruction Buffer Assist Warp Buffer o Triggering assist warps Buffer Holds instructions for different assist warp routines Central point of control for: Assist Warp Mem Deploy Assist Warp Controller Controller Scheduler o Squashing them Tracks progress for active assist warps Assist Warp Trigger 19
Other functionality In the paper: More details on the hardware structures Data communication and synchronization Enforcing priorities 20
CABA: Applications Data compression Memoization Prefetching 21
A Case for CABA: Data Compression Data compression can help alleviate the memory bandwidth bottleneck - transmits data in a more condensed form Compressed Uncompressed Memory Hierarchy Idle! CABA employs idle compute pipelines to perform compression 22
Data Compression with CABA Use assist warps to: Compress cache blocks before writing to memory Decompress cache blocks before placing into the cache CABA flexibly enables various compression algorithms Example: BDI Compression [Pekhimenko+ PACT 12] Parallelizable across SIMT width Low latency Others: FPC [Alameldeen+ TR 04], C-Pack [Chen+ VLSI 10] 23
Walkthrough of Decompression Assist Warp Store Assist Warp Controller Trigger Scheduler L2 + Memory L1D Hit! Miss! Cores 24
Walkthrough of Compression Assist Warp Store Assist Warp Controller Trigger Scheduler L2 + Memory L1D Cores 25
Methodology Simulator: GPGPUSim, GPUWattch Workloads Lonestar, Rodinia, MapReduce, CUDA SDK System Parameters 15 SMs, 32 threads/warp 48 warps/SM, 32768 registers, 32KB Shared Memory Core: 1.4GHz, GTO scheduler , 2 schedulers/SM Memory: 177.4GB/s BW, 6 GDDR5 Memory Controllers, FR-FCFS scheduling Cache: L1 - 16KB, 4-way associative; L2 - 768KB, 16-way associative Metrics Performance: Instructions per Cycle (IPC) Bandwidth Consumption: Fraction of cycles the DRAM data bus is busy 27
Effect on Performance 2.8 Normalized Performance 2.6 2.4 2.2 2 1.8 41.7% 1.6 1.4 1.2 1 CABA-BDI No-Overhead-BDI CABA provides a 41.7% performance improvement CABA achieves performance close to that of designs with no overhead for compression 28
Effect on Bandwidth Consumption 90% Memory Bandwidth Consumption 80% 70% 60% 50% 40% 30% 20% 10% 0% Baseline CABA-BDI Data compression with CABA alleviates the memory bandwidth bottleneck 29
Different Compression Algorithms 2.8 Normalized Performance 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 CABA-FPC CABA-BDI CABA-CPack CABA-BestOfAll CABA is flexible: Improves performance with different compression algorithms 30
Other Results CABA s performance is similar to pure-hardware based BDI compression CABA reduces the overall system energy (22%) by decreasing the off-chip memory traffic Other evaluations: Compression ratios Sensitivity to memory bandwidth Capacity compression Compression at different levels of the hierarchy 31
Conclusion Observation: Imbalances in execution leave GPU resources underutilized Our Goal: Employ underutilized GPU resources to do something useful accelerate bottlenecks using helper threads Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture? Our Solution: CABA (Core-Assisted Bottleneck Acceleration) A new framework to enable helper threading in GPUs Enables flexible data compression to alleviate the memory bandwidth bottleneck A wide set of use cases (e.g., prefetching, memoization) Key Results: Using CABA to implement data compression in memory improves performance by 41.7% 32
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu
34 Backup Slides
Effect on Energy 1.2 Normalized Energy 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 CABA-BDI Ideal-BDI HW-BDI-Mem HW-BDI CABA reduces the overall system energy by decreasing the off-chip memory traffic 35
Other Uses of CABA Hardware Memoization Goal: avoid redundant computation by reusing previous results over the same/similar inputs Idea: hash the inputs at predefined points use load/store pipelines to save inputs in shared memory eliminate redundant computation by loading stored results Prefetching Similar to CPU 37