Managing DRAM Latency Divergence in Irregular GPGPU Applications

 
Managing DRAM Latency Divergence
in Irregular GPGPU Applications
 
Niladrish Chatterjee
Mike O’Connor
Gabriel H. Loh
Nuwan Jayasena
Rajeev Balasubramonian
 
Irregular GPGPU Applications
 
Conventional GPGPU workloads access vector or matrix-based data
structures
Predictable strides, large data parallelism
 
Emerging Irregular Workloads
Pointer-based data-structures & data-dependent memory accesses
Memory Latency Divergence on SIMT platforms
 
Warp-aware memory scheduling to reduce DRAM latency
divergence
 
 
 
 
SC 2014
 
2
SIMT Execution Overview
SC 2014
3
GDDR5
Channel
SIMD Lanes
SIMT
 
Core
SIMT
 
Core
SIMT
 
Core
GDDR5
Channel
Memory Partition
 
THREADS
Memory Partition
 
Warps
 
Lockstep execution
Warp stalled on memory access
Memory Latency Divergence
 
Coalescer has limited efficacy in
irregular workloads
 
Partial hits in L1 and L2
1
st
 source of latency divergence
 
DRAM requests can have varied
latencies
Warp stalled for last request
 
DRAM Latency Divergence
 
Load Inst
SIMD Lanes (32)
SC 2014
4
GPU Memory Controller (GMC)
SC 2014
5
 
Optimized for high throughput
 
Harvest
 
channel and bank parallelism
Address mapping to spread cache-lines across channels and banks.
 
Achieve 
high row-buffer hit rate
Deep queuing
Aggressive reordering of requests for row-hit batching
 
 
Not cognizant of the need to service requests from a warp together
Interleave requests from different warps leading to latency divergence
 
 
 
 
Warp-Aware Scheduling
SC 2014
6
SM 1
SM 2
A: LD
A
A
A
A
B
B
B
B
MC
A: Use
Baseline
GMC
Scheduling
A
B
A
A
A
B: Use
B
B
B
 
Stall Cycles
 
Stall Cycles
Warp-Aware
Scheduling
A
A
A
B
B
A
B
B
A: Use
 
Stall Cycles
B: LD
Reduced Average Memory Stall Time
Impact of DRAM Latency Divergence
SC 2014
7
 
If all requests from a warp were to be returned in
perfect sequence from the DRAM –
~40% improvement.
 
If there was only 1 request per warp – 5X improvement.
 
Key Idea
 
Form batches of requests from each warp
warp-group
 
Schedule all requests from a warp-group together
 
Scheduling algorithm arbitrates between warp-groups to minimize
average stall-time of warps
 
 
SC 2014
 
8
 
Controller Design
 
SC 2014
 
9
 
Controller Design
 
SC 2014
 
10
 
Warp-Group Scheduling : Single Channel
 
SC 2014
 
11
 
Pending Warp-Groups
Warp-group priority
table
Transaction
Scheduler
# of reqs in
warp-
group
Row
hit/miss
status of
reqs
Queuing
delay in
cmd
queues
 
Pick warp-group
with lowest
runtime
 
Each Warp-Group assigned a
priority
Reflects completion time of last
request
 
Higher Priority to
Few requests
High spatial locality
Lightly loaded banks
 
Priorities updated dynamically
 
Transaction Scheduler picks warp-
group with lowest run-time
Shortest-job-first based on
actual service time
 
 
WG-scheduling
SC 2014
12
Latency Divergence
 
Ideal
Bandwidth Utilization
 
GMC
Baseline
 
WG
 
Multiple Memory Controllers
 
Channel level parallelism
Warp’s requests sent to multiple memory channels
Independent scheduling at each controller
 
Subset of warp’s requests can be delayed at one or few memory
controllers
 
Coordinate scheduling between controllers
Prioritize warp-group that has already been serviced at other controllers
Coordination message broadcast to other controllers on completion of a
warp-group.
 
 
 
 
SC 2014
 
13
Warp-Group Scheduling : Multi-Channel
SC 2014
14
Pending Warp-Groups
Priority Table
Transaction
Scheduler
# of reqs in
warp-
group
Row
hit/miss
status of
reqs
Queuing
delay in
cmd
queues
Pick warp-group
with lowest
runtime
Status of
Warp-group
in other
channels
 
Periodic messages
to other channels
about completed
warp-groups
WG-M Scheduling
SC 2014
15
Latency Divergence
Ideal
Bandwidth Utilization
GMC
Baseline
WG
 
WG-M
 
Bandwidth-Aware Warp-Group Scheduling
 
Warp-group scheduling negatively affects bandwidth utilization
Reduced row-hit rate
 
Conflicting objectives
Issue row-miss request from current warp-group
Issue row-hit requests to maintain bus utilization
 
Activate and Precharge idle cycles
Hidden by row-hits in other banks
 
Delay row-miss request to find the right slot
 
 
 
 
 
 
 
 
 
 
 
 
SC 2014
 
16
 
Bandwidth-Aware Warp-Group Scheduling
 
 
SC 2014
 
17
 
The minimum number of row-hits needed in other banks to overlap
(
tRTP+tRP+tRCD
)
Determined by GDDR timing parameters
Minimum efficient row burst (MERB)
 
Stored in a ROM looked up by Transaction Scheduler
 
More banks with pending row-hits
smaller MERB
 
Schedule row-miss after MERB row-hits have been issued to bank
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
WG-Bw Scheduling
SC 2014
18
Latency Divergence
Ideal
Bandwidth Utilization
GMC
Baseline
WG
WG-M
 
WG-Bw
 
Warp-Aware Write Draining
 
Writes drained in batches
starts at High_Watermark
 
Can stall small warp-groups
 
When WQ reaches a threshold (lower than High_Watermark)
Drain singleton warp-groups only
 
Reduce write-induced latency
 
 
 
 
 
 
 
 
SC 2014
 
19
WG-scheduling
SC 2014
20
Latency Divergence
 
Ideal
Bandwidth Utilization
 
GMC
Baseline
 
WG
 
WG-M
 
WG-Bw
 
WG-W
 
Methodology
 
GPGPUSim v3.1 : Cycle Accurate GPGPU simulator
 
 
 
 
 
 
USIMM v1.3 : Cycle Accurate DRAM Simulator
modified to model GMC-baseline & GDDR5 timings
 
Irregular and Regular workloads from Parboil, Rodinia, Lonestar, and MARS.
 
 
 
 
 
 
SC 2014
 
21
 
Performance Improvement
 
 
SC 2014
 
22
Reduced Latency
Divergence
Restored Bandwidth
Utilization
 
Impact on Regular Workloads
 
 
SC 2014
 
23
 
Effective coalescing
 
High spatial locality in warp-group
 
WG scheduling works similar to GMC-baseline
No performance loss
 
WG-Bw and WG-W provide
Minor benefits
 
 
 
 
 
 
 
Energy Impact of Reduced Row Hit-Rate
 
Scheduling Row-misses over Row-
hits
Reduces the row-buffer hit rate 16%
 
In GDDR5, power consumption
dominated by I/O.
 
Increase in DRAM power negligible
compared to execution speed-up
Net improvement in system energy
 
 
 
 
 
SC 2014
 
24
 
Conclusions
 
Irregular applications place new demands on the GPU’s memory
system
 
Memory scheduling can alleviate the issues caused by latency
divergence
 
Carefully orchestrating the scheduling of commands can help regain
the bandwidth lost by warp-aware scheduling
 
Future techniques must also include the cache-hierarchy in reducing
latency divergence
 
 
 
 
SC 2014
 
25
 
Thanks !
 
SC 2014
 
26
 
Backup Slides
 
SC 2014
 
27
 
Performance Improvement : IPC
 
 
SC 2014
 
28
 
Average Warp Stall Latency
 
 
SC 2014
 
29
 
DRAM Latency Divergence
 
 
SC 2014
 
30
 
Bandwidth Utilization
 
 
SC 2014
 
31
 
Memory Controller Microarchitecture
 
 
SC 2014
 
32
 
Warp-Group Scheduling
 
Every batch assigned a priority-score
completion time of the longest request
 
Higher priority to warp groups with
Few requests
High spatial locality
Lightly loaded banks
 
Priorities updated after each warp-group scheduling
 
Warp-group with lowest service time selected
Shortest-job-first based on 
actual service time
, not number of requests
 
SC 2014
 
33
Slide Note
Embed
Share

Addressing memory latency challenges in irregular GPGPU applications, this study explores techniques like warp-aware memory scheduling and GPU memory controller optimization to reduce DRAM latency divergence. The research delves into the impact of SIMD lanes, coalescers, and warp-aware scheduling on performance in irregular workloads. By optimizing memory access patterns and memory controller design, strategies are proposed to enhance memory stall time and overall system efficiency.

  • DRAM Latency Divergence
  • GPGPU Applications
  • Memory Scheduling
  • SIMD Lanes
  • Memory Controller

Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian

  2. Irregular GPGPU Applications Conventional GPGPU workloads access vector or matrix-based data structures Predictable strides, large data parallelism Emerging Irregular Workloads Pointer-based data-structures & data-dependent memory accesses Memory Latency Divergence on SIMT platforms Warp-aware memory scheduling to reduce DRAM latency divergence SC 2014 2

  3. SIMT Execution Overview Warps THREADS SIMTCore Lockstep execution Warp stalled on memory access SIMTCore I SIMTCore N T E R C O N N E C T Warp 1 Memory Partition Warp 2 Warp Scheduler Memory Controller Warp 3 L2 Slice GDDR5 GDDR5 Channel Warp N SIMD Lanes Memory Partition Memory Controller L2 Slice GDDR5 Memory Port GDDR5 L1 Channel SC 2014 3

  4. Memory Latency Divergence SIMD Lanes (32) Coalescer has limited efficacy in irregular workloads Load Inst Partial hits in L1 and L2 1st source of latency divergence Access Coalescing Unit L1 DRAM requests can have varied latencies Warp stalled for last request L2 DRAM Latency Divergence GDDR5 SC 2014 4

  5. GPU Memory Controller (GMC) Optimized for high throughput Harvest channel and bank parallelism Address mapping to spread cache-lines across channels and banks. Achieve high row-buffer hit rate Deep queuing Aggressive reordering of requests for row-hit batching Not cognizant of the need to service requests from a warp together Interleave requests from different warps leading to latency divergence SC 2014 5

  6. Warp-Aware Scheduling Stall Cycles Stall Cycles SM 1 A: Use A: LD A: Use Stall Cycles B: Use SM 2 B: LD A A B B MC A A B B Reduced Average Memory Stall Time Baseline GMC Scheduling A B B A A B A B Warp-Aware Scheduling B B B A B A A A SC 2014 6

  7. Impact of DRAM Latency Divergence If all requests from a warp were to be returned in perfect sequence from the DRAM ~40% improvement. If there was only 1 request per warp 5X improvement. SC 2014 7

  8. Key Idea Form batches of requests from each warp warp-group Schedule all requests from a warp-group together Scheduling algorithm arbitrates between warp-groups to minimize average stall-time of warps SC 2014 8

  9. Controller Design SC 2014 9

  10. Controller Design SC 2014 10

  11. Warp-Group Scheduling : Single Channel Each Warp-Group assigned a priority Reflects completion time of last request Pending Warp-Groups Row hit/miss status of reqs Queuing delay in cmd queues # of reqs in warp- group Higher Priority to Few requests High spatial locality Lightly loaded banks Warp-group priority table Transaction Scheduler Priorities updated dynamically Transaction Scheduler picks warp- group with lowest run-time Shortest-job-first based on actual service time Pick warp-group with lowest runtime SC 2014 11

  12. WG-scheduling GMC Baseline WG Latency Divergence Ideal Bandwidth Utilization SC 2014 12

  13. Multiple Memory Controllers Channel level parallelism Warp s requests sent to multiple memory channels Independent scheduling at each controller Subset of warp s requests can be delayed at one or few memory controllers Coordinate scheduling between controllers Prioritize warp-group that has already been serviced at other controllers Coordination message broadcast to other controllers on completion of a warp-group. SC 2014 13

  14. Warp-Group Scheduling : Multi-Channel Pending Warp-Groups Row hit/miss status of reqs Queuing delay in cmd queues # of reqs in warp- group Periodic messages to other channels about completed warp-groups Transaction Scheduler Priority Table Status of Warp-group in other channels Pick warp-group with lowest runtime SC 2014 14

  15. WG-M Scheduling GMC Baseline WG Latency Divergence WG-M Ideal Bandwidth Utilization SC 2014 15

  16. Bandwidth-Aware Warp-Group Scheduling Warp-group scheduling negatively affects bandwidth utilization Reduced row-hit rate Conflicting objectives Issue row-miss request from current warp-group Issue row-hit requests to maintain bus utilization Activate and Precharge idle cycles Hidden by row-hits in other banks Delay row-miss request to find the right slot SC 2014 16

  17. Bandwidth-Aware Warp-Group Scheduling The minimum number of row-hits needed in other banks to overlap (tRTP+tRP+tRCD) Determined by GDDR timing parameters Minimum efficient row burst (MERB) Stored in a ROM looked up by Transaction Scheduler More banks with pending row-hits smaller MERB Schedule row-miss after MERB row-hits have been issued to bank SC 2014 17

  18. WG-Bw Scheduling GMC Baseline WG Latency Divergence WG-Bw WG-M Ideal Bandwidth Utilization SC 2014 18

  19. Warp-Aware Write Draining Writes drained in batches starts at High_Watermark Can stall small warp-groups When WQ reaches a threshold (lower than High_Watermark) Drain singleton warp-groups only Reduce write-induced latency SC 2014 19

  20. WG-scheduling GMC Baseline WG Latency Divergence WG-Bw WG-M WG-W Ideal Bandwidth Utilization SC 2014 20

  21. Methodology GPGPUSim v3.1 : Cycle Accurate GPGPU simulator SM Cores 30 Max Threads/Core 1024 Warp Size 32 Threads/warp L1 / L2 32KB / 128 KB DRAM 6Gbps GDDR5 DRAM Channels Banks 6 Channels 16 Banks/channel USIMM v1.3 : Cycle Accurate DRAM Simulator modified to model GMC-baseline & GDDR5 timings Irregular and Regular workloads from Parboil, Rodinia, Lonestar, and MARS. SC 2014 21

  22. Performance Improvement Restored Bandwidth Utilization 12% Reduced Latency Divergence 10% IPC NORMALIZED TO 8% BASELINE 6% 4% 2% 0% WG WG-M WG-Bw WG-W SC 2014 22

  23. Impact on Regular Workloads Effective coalescing High spatial locality in warp-group WG scheduling works similar to GMC-baseline No performance loss WG-Bw and WG-W provide Minor benefits SC 2014 23

  24. Energy Impact of Reduced Row Hit-Rate Scheduling Row-misses over Row- hits Reduces the row-buffer hit rate 16% GDDR5 Energy/bit 40 35 30 In GDDR5, power consumption dominated by I/O. 25 pJ/bit 20 15 10 Increase in DRAM power negligible compared to execution speed-up Net improvement in system energy 5 0 Baseline WG-Bw I/O Row Column Control DLL Background SC 2014 24

  25. Conclusions Irregular applications place new demands on the GPU s memory system Memory scheduling can alleviate the issues caused by latency divergence Carefully orchestrating the scheduling of commands can help regain the bandwidth lost by warp-aware scheduling Future techniques must also include the cache-hierarchy in reducing latency divergence SC 2014 25

  26. Thanks ! SC 2014 26

  27. Backup Slides SC 2014 27

  28. Performance Improvement : IPC SC 2014 28

  29. Average Warp Stall Latency SC 2014 29

  30. DRAM Latency Divergence SC 2014 30

  31. Bandwidth Utilization SC 2014 31

  32. Memory Controller Microarchitecture SC 2014 32

  33. Warp-Group Scheduling Every batch assigned a priority-score completion time of the longest request Higher priority to warp groups with Few requests High spatial locality Lightly loaded banks Priorities updated after each warp-group scheduling Warp-group with lowest service time selected Shortest-job-first based on actual service time, not number of requests SC 2014 33

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#