Core-Assisted Bottleneck Acceleration in GPUs: Maximizing Resource Utilization

undefined
 
A Case for Core-Assisted
Bottleneck Acceleration in GPUs
Enabling Flexible Data Compression
with Assist Warps
 
Nandita Vijaykumar
Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick,
Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir,
Todd C. Mowry, Onur Mutlu
Executive Summary
 
Observation: 
Imbalances in execution leave GPU resources
underutilized
Our Goal: 
Employ underutilized GPU resources to do something
useful – 
accelerate bottlenecks using helper threads
Challenge: 
How do you efficiently 
manage and use 
helper
threads in a 
throughput-oriented
 
architecture?
Our Solution: 
CABA (Core-Assisted Bottleneck Acceleration)
A new framework to enable helper threading in GPUs
Enables flexible data compression
 to alleviate the memory
bandwidth bottleneck
A wide set of use cases 
(e.g., prefetching, memoization)
Key Results: 
Using CABA to implement data compression in
memory improves performance by 41.7%
2
undefined
GPUs today are used for a wide range
of applications …
 
Computer Vision
 
Data Analytics
 
Scientific
Simulation
 
Medical
Imaging
3
Challenges in GPU Efficiency
Thread
0
Thread
1
Thread
2
Thread
3
Full!
 
Thread limits lead to an underutilized register file
 
The memory bandwidth bottleneck leads to idle cores
Threads
4
Idle!
Full!
Motivation: Unutilized On-chip Memory
 
24% 
of the register file is unallocated on average
Similar trends for on-chip scratchpad memory
5
Motivation: Idle Pipelines
 
Memory Bound
 
Compute Bound
6
67% 
of cycles idle
35% 
of cycles idle
Motivation: Summary
 
Heterogeneous application requirements lead to:
 
Bottlenecks 
in execution
Idle 
resources
 
7
Our Goal
Memory
Hierarchy
 
Cores
 
Register File
 
Use idle resources to do something useful:
accelerate bottlenecks using helper threads
 
A flexible framework to enable helper threading in GPUs:
C
ore-
A
ssisted 
B
ottleneck 
A
cceleration (
CABA
)
8
 
Helper
threads
Helper threads in GPUs
 
Large body of work in CPUs …
[Chappell+ ISCA ’99, MICRO ’02], [Yang+ USC TR ’98],
[Dubois+ CF ’04], [Zilles+ ISCA ’01], [Collins+ ISCA ’01,
MICRO ’01], [Aamodt+ HPCA ’04], [Lu+ MICRO ’05],
[Luk+ ISCA ’01], [Moshovos+ ICS ’01], [Kamruzzaman+
ASPLOS ’11], etc.
However, there are new challenges with GPUs…
9
 
Challenge
 
How do you efficiently
manage and use helper threads
in a 
throughput-oriented architecture
?
 
10
Managing Helper Threads in GPUs
 
Thread
 
Warp
 
Block
 
Software
 
Hardware
 
Where do we add helper threads?
11
Approach #1: Software-only
 
Regular threads
 
Helper threads
 
No hardware changes
 
Coarse grained
 
Not aware of runtime
program behavior
12
 
Synchronization is
difficult
Where Do We Add Helper Threads?
Thread
Warp
Block
Software
 
Hardware
13
Approach #2: Hardware-only
14
 
Fine-grained control
Synchronization
Enforcing Priorities
 
GPU
 
Cores
 
Register File
 
Warps
 
Core 0
 
Core 1
Reg File 0
Reg File 
1
 
CPU
Reg File 0
Reg File 1
 
Providing contexts
efficiently is difficult
CABA: An Overview
 
“Tight coupling” 
of helper threads and
regular threads
 
 
SW
HW
 
“Decoupled management” 
of helper threads
and regular threads
 
 Efficient context management
 Simpler data communication
 
 Dynamic management of threads
 Fine-grained synchronization
15
CABA: 1. In Software
 
Helper threads:
Tightly coupled 
to
regular threads
Simply instructions
injected into the GPU
pipelines
Share the same
context as the regular
threads
Regs
16
 
Regular threads
 
Helper threads
 
 Efficient context management
 Simpler data communication
CABA: 2. In Hardware
 
Helper threads:
Decoupled
 from regular threads
Tracked at the granularity of a 
warp – 
Assist Warp
Each 
regular (
parent
)
 
warp can have different 
assist
warps
 
Parent Warp: X
 
Assist Warp: A
 
Assist Warp: B
17
 
 Dynamic management
of threads
 
 Fine-grained
synchronization
 
Key Functionalities
 
Triggering and squashing assist warps
Associating events with assist warps
 
Deploying active assist warps
Scheduling instructions for execution
 
Enforcing priorities
Between assist warps and parent warps
Between different assist warps
 
18
 
Deploy
Scheduler
CABA: Mechanism
ALU
Fetch
I-Cache
Assist
Warp
Store
W
r
i
t
e
b
a
c
k
Instruction
Buffer
Assist Warp
Buffer
Scoreboard
D
e
c
o
d
e
ALU
ALU
Mem
I
s
s
u
e
 
Trigger
Assist Warp
Controller
Assist
Warp
Store
Holds instructions for different assist warp
routines
Assist Warp
Controller
Central point of control for:
o
Triggering assist warps
o
Squashing them
Tracks progress for active assist
warps
Assist Warp
Buffer
Stages instructions from triggered
assist warps for execution
Helps enforce priorities
19
 
Other functionality
 
In the paper:
More details on the hardware structures
Data communication and synchronization
Enforcing priorities
 
 
20
CABA: Applications
 
 
 
 
 
D
a
t
a
 
c
o
m
p
r
e
s
s
i
o
n
M
e
m
o
i
z
a
t
i
o
n
P
r
e
f
e
t
c
h
i
n
g
21
A Case for CABA: Data Compression
 
Data compression 
can help alleviate the 
memory
bandwidth bottleneck
 - transmits data in a more
condensed form
Memory
Hierarchy
 
Compressed
 
Uncompressed
 
CABA employs idle compute pipelines to perform
compression
22
Data Compression with CABA
 
Use assist warps to:
Compress cache blocks before writing to memory
Decompress cache blocks before 
placing
 into the cache
CABA flexibly enables various compression algorithms
Example: 
BDI Compression 
[Pekhimenko+ PACT ’12]
Parallelizable across SIMT width
Low latency
Others: 
FPC
 
[Alameldeen+ TR ’04],
 C-Pack 
[Chen+ VLSI ’10]
 
 
 
23
Walkthrough of Decompression
Scheduler
L1D
L2 +
Memory
Assist
Warp
Store
Assist
Warp
Controller
Cores
 
Uncompressed
 
Compressed
Hit!
Miss!
 
Trigger
24
Walkthrough of Compression
Scheduler
L1D
L2 +
Memory
Assist
Warp
Store
Assist
Warp
Controller
Cores
 
Trigger
25
undefined
 
 
Evaluation
 
Methodology
 
Simulator
:  GPGPUSim, GPUWattch
Workloads
 
Lonestar, Rodinia, MapReduce, CUDA SDK
System Parameters
 15 SMs, 32 threads/warp
48 warps/SM, 32768 registers, 32KB Shared Memory
Core: 1.4GHz, GTO scheduler , 2 schedulers/SM
Memory: 177.4GB/s BW, 6 GDDR5 Memory Controllers,
FR-FCFS scheduling
Cache: L1 - 16KB, 4-way associative; L2 - 768KB, 16-way
associative
Metrics
Performance: Instructions per Cycle (IPC)
Bandwidth Consumption: Fraction of cycles the DRAM data
bus is busy
 
27
Effect on Performance
CABA provides a 41.7% performance improvement
CABA achieves performance close to that of designs
with no overhead for compression
28
41.7%
Effect on Bandwidth Consumption
Data compression with CABA alleviates
the memory bandwidth bottleneck
29
Different Compression Algorithms
CABA is flexible: Improves performance with
different compression algorithms
30
Other Results
 
CABA’s performance is similar to pure-hardware
based BDI compression
CABA reduces the overall system energy (22%)  by
decreasing the off-chip memory traffic
Other evaluations:
Compression ratios
Sensitivity to memory bandwidth
Capacity compression
Compression at different levels of the hierarchy
 
31
Conclusion
 
Observation: 
Imbalances in execution leave GPU resources
underutilized
Our Goal: 
Employ underutilized GPU resources to do something
useful – 
accelerate bottlenecks using helper threads
Challenge: 
How do you efficiently 
manage and use 
helper
threads in a 
throughput-oriented
 
architecture?
Our Solution: 
CABA (Core-Assisted Bottleneck Acceleration)
A new framework to enable helper threading in GPUs
Enables flexible data compression
 to alleviate the memory
bandwidth bottleneck
A wide set of use cases 
(e.g., prefetching, memoization)
Key Results: 
Using CABA to implement data compression in
memory improves performance by 41.7%
32
undefined
 
A Case for Core-Assisted
Bottleneck Acceleration in GPUs
Enabling Flexible Data Compression
with Assist Warps
 
Nandita Vijaykumar
Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick,
Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir,
Todd C. Mowry, Onur Mutlu
undefined
 
 
Backup Slides
 
34
Effect on Energy
 CABA reduces the overall system energy by
decreasing the off-chip memory traffic
35
 
Effect on Compression Ratio
 
36
Other Uses of CABA
37
 
Hardware Memoization
Goal: avoid redundant computation by reusing
previous results over the same/similar inputs
Idea:
hash the inputs at predefined points
use load/store pipelines to save inputs in shared memory
eliminate redundant computation by loading stored results
Prefetching
Similar to CPU
 
Slide Note

In this work we observe that different bottlenecks and imbalances in execution leave different GPU resources idle

Our goal in this work is employ these idle resources to accelerate different bottlenecks in execution

In order to do this we propose to employ light-weight helper threads

However, there are some key challenges that need to be addressed to enable helper threading in GPUs

We propose a new framework CABA that effectively addresses these challenges

The framework is flexible enough to be applied to a wide range of use case

In this work, the apply the framework to enable flexible data compression

We find that CABA shows that improves the performance of a wide range of GPGPU applications evaluated

Embed
Share

Imbalances in GPU execution lead to underutilization of resources, prompting the need for a solution like CABA (Core-Assisted Bottleneck Acceleration). This framework enables the efficient use of helper threads in GPUs, addressing memory bandwidth bottlenecks through flexible data compression. By leveraging helper threading, performance improvements of up to 41.7% have been achieved, showcasing the potential of optimizing GPU resources for various applications.

  • GPUs
  • Data Compression
  • Bottleneck Acceleration
  • Resource Utilization
  • Helper Threads

Uploaded on Sep 21, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu

  2. Executive Summary Observation: Imbalances in execution leave GPU resources underutilized Our Goal: Employ underutilized GPU resources to do something useful accelerate bottlenecks using helper threads Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture? Our Solution: CABA (Core-Assisted Bottleneck Acceleration) A new framework to enable helper threading in GPUs Enables flexible data compression to alleviate the memory bandwidth bottleneck A wide set of use cases (e.g., prefetching, memoization) Key Results: Using CABA to implement data compression in memory improves performance by 41.7% 2

  3. GPUs today are used for a wide range of applications Scientific Simulation Medical Imaging Computer Vision Data Analytics 3

  4. Challenges in GPU Efficiency Threads Cores Register File Thread 3 Idle! Thread 2 Full! Memory Hierarchy Idle! Full! Thread 1 Thread 0 GPU Streaming Multiprocessor Thread limits lead to an underutilized register file The memory bandwidth bottleneck leads to idle cores 4

  5. Motivation: Unutilized On-chip Memory 100% % Unallocated Registers 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 24% of the register file is unallocated on average Similar trends for on-chip scratchpad memory 5

  6. Motivation: Idle Pipelines 100% 80% % Cycles 60% Active Stalls 67% of cycles idle 40% 20% 0% CONS JPEG LPS MUM RAY SCP PVC PVR bfs Avg. Memory Bound 100% 80% % Cycles 60% 35% of cycles idle Active Stalls 40% 20% 0% NN STO bp hs dmr NQU SLA lc pt mc 6 Compute Bound

  7. Motivation: Summary Heterogeneous application requirements lead to: Bottlenecks in execution Idle resources 7

  8. Our Goal Use idle resources to do something useful: accelerate bottlenecks using helper threads Cores Register File Memory Hierarchy Helper threads A flexible framework to enable helper threading in GPUs: Core-Assisted Bottleneck Acceleration (CABA) 8

  9. Helper threads in GPUs Large body of work in CPUs [Chappell+ ISCA 99, MICRO 02], [Yang+ USC TR 98], [Dubois+ CF 04], [Zilles+ ISCA 01], [Collins+ ISCA 01, MICRO 01], [Aamodt+ HPCA 04], [Lu+ MICRO 05], [Luk+ ISCA 01], [Moshovos+ ICS 01], [Kamruzzaman+ ASPLOS 11], etc. However, there are new challenges with GPUs 9

  10. Challenge How do you efficiently manage and use helper threads in a throughput-oriented architecture? 10

  11. Managing Helper Threads in GPUs Thread Hardware Warp Software Block Where do we add helper threads? 11

  12. Approach #1: Software-only No hardware changes Coarse grained Regular threads Synchronization is difficult Not aware of runtime program behavior Helper threads 12

  13. Where Do We Add Helper Threads? Thread Warp Hardware Block Software 13

  14. Approach #2: Hardware-only Fine-grained control Synchronization Enforcing Priorities Warps Cores Register File CPU Core 1 Core 0 GPU Reg File 0 Reg File 0 Providing contexts efficiently is difficult Reg File 1 Reg File 1 14

  15. CABA: An Overview Tight coupling of helper threads and regular threads Efficient context management Simpler data communication SW HW Decoupled management of helper threads and regular threads Dynamic management of threads Fine-grained synchronization 15

  16. CABA: 1. In Software Helper threads: Tightly coupled to regular threads Simply instructions injected into the GPU pipelines Share the same context as the regular threads Regular threads Regs Block Helper threads Efficient context management Simpler data communication 16

  17. CABA: 2. In Hardware Helper threads: Decoupled from regular threads Tracked at the granularity of a warp Assist Warp Each regular (parent)warp can have different assist warps Dynamic management of threads Assist Warp: A Fine-grained synchronization Parent Warp: X 17 Assist Warp: B

  18. Key Functionalities Triggering and squashing assist warps Associating events with assist warps Deploying active assist warps Scheduling instructions for execution Enforcing priorities Between assist warps and parent warps Between different assist warps 18

  19. CABA: Mechanism Fetch W r i t e b a c k Stages instructions from triggered assist warps for execution D e c o d e Scoreboard ALU ALU ALU I s s u e I-Cache Helps enforce priorities Assist Warp Assist Warp Store Store Instruction Buffer Assist Warp Buffer o Triggering assist warps Buffer Holds instructions for different assist warp routines Central point of control for: Assist Warp Mem Deploy Assist Warp Controller Controller Scheduler o Squashing them Tracks progress for active assist warps Assist Warp Trigger 19

  20. Other functionality In the paper: More details on the hardware structures Data communication and synchronization Enforcing priorities 20

  21. CABA: Applications Data compression Memoization Prefetching 21

  22. A Case for CABA: Data Compression Data compression can help alleviate the memory bandwidth bottleneck - transmits data in a more condensed form Compressed Uncompressed Memory Hierarchy Idle! CABA employs idle compute pipelines to perform compression 22

  23. Data Compression with CABA Use assist warps to: Compress cache blocks before writing to memory Decompress cache blocks before placing into the cache CABA flexibly enables various compression algorithms Example: BDI Compression [Pekhimenko+ PACT 12] Parallelizable across SIMT width Low latency Others: FPC [Alameldeen+ TR 04], C-Pack [Chen+ VLSI 10] 23

  24. Walkthrough of Decompression Assist Warp Store Assist Warp Controller Trigger Scheduler L2 + Memory L1D Hit! Miss! Cores 24

  25. Walkthrough of Compression Assist Warp Store Assist Warp Controller Trigger Scheduler L2 + Memory L1D Cores 25

  26. Evaluation

  27. Methodology Simulator: GPGPUSim, GPUWattch Workloads Lonestar, Rodinia, MapReduce, CUDA SDK System Parameters 15 SMs, 32 threads/warp 48 warps/SM, 32768 registers, 32KB Shared Memory Core: 1.4GHz, GTO scheduler , 2 schedulers/SM Memory: 177.4GB/s BW, 6 GDDR5 Memory Controllers, FR-FCFS scheduling Cache: L1 - 16KB, 4-way associative; L2 - 768KB, 16-way associative Metrics Performance: Instructions per Cycle (IPC) Bandwidth Consumption: Fraction of cycles the DRAM data bus is busy 27

  28. Effect on Performance 2.8 Normalized Performance 2.6 2.4 2.2 2 1.8 41.7% 1.6 1.4 1.2 1 CABA-BDI No-Overhead-BDI CABA provides a 41.7% performance improvement CABA achieves performance close to that of designs with no overhead for compression 28

  29. Effect on Bandwidth Consumption 90% Memory Bandwidth Consumption 80% 70% 60% 50% 40% 30% 20% 10% 0% Baseline CABA-BDI Data compression with CABA alleviates the memory bandwidth bottleneck 29

  30. Different Compression Algorithms 2.8 Normalized Performance 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 CABA-FPC CABA-BDI CABA-CPack CABA-BestOfAll CABA is flexible: Improves performance with different compression algorithms 30

  31. Other Results CABA s performance is similar to pure-hardware based BDI compression CABA reduces the overall system energy (22%) by decreasing the off-chip memory traffic Other evaluations: Compression ratios Sensitivity to memory bandwidth Capacity compression Compression at different levels of the hierarchy 31

  32. Conclusion Observation: Imbalances in execution leave GPU resources underutilized Our Goal: Employ underutilized GPU resources to do something useful accelerate bottlenecks using helper threads Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture? Our Solution: CABA (Core-Assisted Bottleneck Acceleration) A new framework to enable helper threading in GPUs Enables flexible data compression to alleviate the memory bandwidth bottleneck A wide set of use cases (e.g., prefetching, memoization) Key Results: Using CABA to implement data compression in memory improves performance by 41.7% 32

  33. A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu

  34. 34 Backup Slides

  35. Effect on Energy 1.2 Normalized Energy 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 CABA-BDI Ideal-BDI HW-BDI-Mem HW-BDI CABA reduces the overall system energy by decreasing the off-chip memory traffic 35

  36. Effect on Compression Ratio 36

  37. Other Uses of CABA Hardware Memoization Goal: avoid redundant computation by reusing previous results over the same/similar inputs Idea: hash the inputs at predefined points use load/store pipelines to save inputs in shared memory eliminate redundant computation by loading stored results Prefetching Similar to CPU 37

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#