Managing DRAM Latency Divergence in Irregular GPGPU Applications

Managing DRAM Latency Divergence

in Irregular GPGPU Applications

Niladrish Chatterjee

Mike O’Connor

Gabriel H. Loh

Nuwan Jayasena

Rajeev Balasubramonian

Irregular GPGPU Applications

•

Conventional GPGPU workloads access vector or matrix-based data

structures

•

Predictable strides, large data parallelism

•

Emerging Irregular Workloads

•

Pointer-based data-structures & data-dependent memory accesses

•

Memory Latency Divergence on SIMT platforms

Warp-aware memory scheduling to reduce DRAM latency

divergence

SC 2014

SIMT Execution Overview

SC 2014

GDDR5

Channel

SIMD Lanes

SIMT

Core

SIMT

Core

SIMT

Core

GDDR5

Channel

Memory Partition

THREADS

Memory Partition

Warps

Lockstep execution

Warp stalled on memory access

Memory Latency Divergence

•

Coalescer has limited efficacy in

irregular workloads

•

Partial hits in L1 and L2

•

st

 source of latency divergence

•

DRAM requests can have varied

latencies

•

Warp stalled for last request

•

DRAM Latency Divergence

Load Inst

SIMD Lanes (32)

SC 2014

GPU Memory Controller (GMC)

SC 2014

•

Optimized for high throughput

•

Harvest

channel and bank parallelism

•

Address mapping to spread cache-lines across channels and banks.

•

Achieve

high row-buffer hit rate

•

Deep queuing

•

Aggressive reordering of requests for row-hit batching

•

Not cognizant of the need to service requests from a warp together

•

Interleave requests from different warps leading to latency divergence

Warp-Aware Scheduling

SC 2014

SM 1

SM 2

A: LD

MC

A: Use

Baseline

GMC

Scheduling

B: Use

Stall Cycles

Stall Cycles

Warp-Aware

Scheduling

A: Use

Stall Cycles

B: LD

Reduced Average Memory Stall Time

Impact of DRAM Latency Divergence

SC 2014

If all requests from a warp were to be returned in

perfect sequence from the DRAM –

~40% improvement.

If there was only 1 request per warp – 5X improvement.

Key Idea

•

Form batches of requests from each warp

•

warp-group

•

Schedule all requests from a warp-group together

•

Scheduling algorithm arbitrates between warp-groups to minimize

average stall-time of warps

SC 2014

Controller Design

SC 2014

Controller Design

SC 2014

Warp-Group Scheduling : Single Channel

SC 2014

Pending Warp-Groups

Warp-group priority

table

Transaction

Scheduler

# of reqs in

warp-

group

Row

hit/miss

status of

reqs

Queuing

delay in

cmd

queues

Pick warp-group

with lowest

runtime

•

Each Warp-Group assigned a

priority

•

Reflects completion time of last

request

•

Higher Priority to

•

Few requests

•

High spatial locality

•

Lightly loaded banks

•

Priorities updated dynamically

•

Transaction Scheduler picks warp-

group with lowest run-time

•

Shortest-job-first based on

actual service time

WG-scheduling

SC 2014

Latency Divergence

Ideal

Bandwidth Utilization

GMC

Baseline

WG

Multiple Memory Controllers

•

Channel level parallelism

•

Warp’s requests sent to multiple memory channels

•

Independent scheduling at each controller

•

Subset of warp’s requests can be delayed at one or few memory

controllers

•

Coordinate scheduling between controllers

•

Prioritize warp-group that has already been serviced at other controllers

•

Coordination message broadcast to other controllers on completion of a

warp-group.

SC 2014

Warp-Group Scheduling : Multi-Channel

SC 2014

Pending Warp-Groups

Priority Table

Transaction

Scheduler

# of reqs in

warp-

group

Row

hit/miss

status of

reqs

Queuing

delay in

cmd

queues

Pick warp-group

with lowest

runtime

Status of

Warp-group

in other

channels

Periodic messages

to other channels

about completed

warp-groups

WG-M Scheduling

SC 2014

Latency Divergence

Ideal

Bandwidth Utilization

GMC

Baseline

WG

WG-M

Bandwidth-Aware Warp-Group Scheduling

•

Warp-group scheduling negatively affects bandwidth utilization

•

Reduced row-hit rate

•

Conflicting objectives

•

Issue row-miss request from current warp-group

•

Issue row-hit requests to maintain bus utilization

•

Activate and Precharge idle cycles

•

Hidden by row-hits in other banks

•

Delay row-miss request to find the right slot

SC 2014

Bandwidth-Aware Warp-Group Scheduling

SC 2014

•

The minimum number of row-hits needed in other banks to overlap

tRTP+tRP+tRCD

•

Determined by GDDR timing parameters

•

Minimum efficient row burst (MERB)

•

Stored in a ROM looked up by Transaction Scheduler

•

More banks with pending row-hits

•

smaller MERB

•

Schedule row-miss after MERB row-hits have been issued to bank

WG-Bw Scheduling

SC 2014

Latency Divergence

Ideal

Bandwidth Utilization

GMC

Baseline

WG

WG-M

WG-Bw

Warp-Aware Write Draining

•

Writes drained in batches

•

starts at High_Watermark

•

Can stall small warp-groups

•

When WQ reaches a threshold (lower than High_Watermark)

•

Drain singleton warp-groups only

•

Reduce write-induced latency

SC 2014

WG-scheduling

SC 2014

Latency Divergence

Ideal

Bandwidth Utilization

GMC

Baseline

WG

WG-M

WG-Bw

WG-W

Methodology

•

GPGPUSim v3.1 : Cycle Accurate GPGPU simulator

•

USIMM v1.3 : Cycle Accurate DRAM Simulator

•

modified to model GMC-baseline & GDDR5 timings

•

Irregular and Regular workloads from Parboil, Rodinia, Lonestar, and MARS.

SC 2014

Performance Improvement

SC 2014

Reduced Latency

Divergence

Restored Bandwidth

Utilization

Impact on Regular Workloads

SC 2014

•

Effective coalescing

•

High spatial locality in warp-group

•

WG scheduling works similar to GMC-baseline

•

No performance loss

•

WG-Bw and WG-W provide

•

Minor benefits

Energy Impact of Reduced Row Hit-Rate

•

Scheduling Row-misses over Row-

hits

•

Reduces the row-buffer hit rate 16%

•

In GDDR5, power consumption

dominated by I/O.

•

Increase in DRAM power negligible

compared to execution speed-up

•

Net improvement in system energy

SC 2014

Conclusions

•

Irregular applications place new demands on the GPU’s memory

system

•

Memory scheduling can alleviate the issues caused by latency

divergence

•

Carefully orchestrating the scheduling of commands can help regain

the bandwidth lost by warp-aware scheduling

•

Future techniques must also include the cache-hierarchy in reducing

latency divergence

SC 2014

Thanks !

SC 2014

Backup Slides

SC 2014

Performance Improvement : IPC

SC 2014

Average Warp Stall Latency

SC 2014

DRAM Latency Divergence

SC 2014

Bandwidth Utilization

SC 2014

Memory Controller Microarchitecture

SC 2014

Warp-Group Scheduling

•

Every batch assigned a priority-score

•

completion time of the longest request

•

Higher priority to warp groups with

•

Few requests

•

High spatial locality

•

Lightly loaded banks

•

Priorities updated after each warp-group scheduling

•

Warp-group with lowest service time selected

•

Shortest-job-first based on

actual service time

, not number of requests

SC 2014

Slide Note

Embed Share

Download

Addressing memory latency challenges in irregular GPGPU applications, this study explores techniques like warp-aware memory scheduling and GPU memory controller optimization to reduce DRAM latency divergence. The research delves into the impact of SIMD lanes, coalescers, and warp-aware scheduling on performance in irregular workloads. By optimizing memory access patterns and memory controller design, strategies are proposed to enhance memory stall time and overall system efficiency.

mayt_93 Follow

Uploaded on Sep 15, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian

Irregular GPGPU Applications Conventional GPGPU workloads access vector or matrix-based data structures Predictable strides, large data parallelism Emerging Irregular Workloads Pointer-based data-structures & data-dependent memory accesses Memory Latency Divergence on SIMT platforms Warp-aware memory scheduling to reduce DRAM latency divergence SC 2014 2

SIMT Execution Overview Warps THREADS SIMTCore Lockstep execution Warp stalled on memory access SIMTCore I SIMTCore N T E R C O N N E C T Warp 1 Memory Partition Warp 2 Warp Scheduler Memory Controller Warp 3 L2 Slice GDDR5 GDDR5 Channel Warp N SIMD Lanes Memory Partition Memory Controller L2 Slice GDDR5 Memory Port GDDR5 L1 Channel SC 2014 3

Memory Latency Divergence SIMD Lanes (32) Coalescer has limited efficacy in irregular workloads Load Inst Partial hits in L1 and L2 1st source of latency divergence Access Coalescing Unit L1 DRAM requests can have varied latencies Warp stalled for last request L2 DRAM Latency Divergence GDDR5 SC 2014 4

GPU Memory Controller (GMC) Optimized for high throughput Harvest channel and bank parallelism Address mapping to spread cache-lines across channels and banks. Achieve high row-buffer hit rate Deep queuing Aggressive reordering of requests for row-hit batching Not cognizant of the need to service requests from a warp together Interleave requests from different warps leading to latency divergence SC 2014 5

Warp-Aware Scheduling Stall Cycles Stall Cycles SM 1 A: Use A: LD A: Use Stall Cycles B: Use SM 2 B: LD A A B B MC A A B B Reduced Average Memory Stall Time Baseline GMC Scheduling A B B A A B A B Warp-Aware Scheduling B B B A B A A A SC 2014 6

Impact of DRAM Latency Divergence If all requests from a warp were to be returned in perfect sequence from the DRAM ~40% improvement. If there was only 1 request per warp 5X improvement. SC 2014 7

Key Idea Form batches of requests from each warp warp-group Schedule all requests from a warp-group together Scheduling algorithm arbitrates between warp-groups to minimize average stall-time of warps SC 2014 8

Controller Design SC 2014 9

Controller Design SC 2014 10

Warp-Group Scheduling : Single Channel Each Warp-Group assigned a priority Reflects completion time of last request Pending Warp-Groups Row hit/miss status of reqs Queuing delay in cmd queues # of reqs in warp- group Higher Priority to Few requests High spatial locality Lightly loaded banks Warp-group priority table Transaction Scheduler Priorities updated dynamically Transaction Scheduler picks warp- group with lowest run-time Shortest-job-first based on actual service time Pick warp-group with lowest runtime SC 2014 11

WG-scheduling GMC Baseline WG Latency Divergence Ideal Bandwidth Utilization SC 2014 12

Multiple Memory Controllers Channel level parallelism Warp s requests sent to multiple memory channels Independent scheduling at each controller Subset of warp s requests can be delayed at one or few memory controllers Coordinate scheduling between controllers Prioritize warp-group that has already been serviced at other controllers Coordination message broadcast to other controllers on completion of a warp-group. SC 2014 13

Warp-Group Scheduling : Multi-Channel Pending Warp-Groups Row hit/miss status of reqs Queuing delay in cmd queues # of reqs in warp- group Periodic messages to other channels about completed warp-groups Transaction Scheduler Priority Table Status of Warp-group in other channels Pick warp-group with lowest runtime SC 2014 14

WG-M Scheduling GMC Baseline WG Latency Divergence WG-M Ideal Bandwidth Utilization SC 2014 15

Bandwidth-Aware Warp-Group Scheduling Warp-group scheduling negatively affects bandwidth utilization Reduced row-hit rate Conflicting objectives Issue row-miss request from current warp-group Issue row-hit requests to maintain bus utilization Activate and Precharge idle cycles Hidden by row-hits in other banks Delay row-miss request to find the right slot SC 2014 16

Bandwidth-Aware Warp-Group Scheduling The minimum number of row-hits needed in other banks to overlap (tRTP+tRP+tRCD) Determined by GDDR timing parameters Minimum efficient row burst (MERB) Stored in a ROM looked up by Transaction Scheduler More banks with pending row-hits smaller MERB Schedule row-miss after MERB row-hits have been issued to bank SC 2014 17

WG-Bw Scheduling GMC Baseline WG Latency Divergence WG-Bw WG-M Ideal Bandwidth Utilization SC 2014 18

Warp-Aware Write Draining Writes drained in batches starts at High_Watermark Can stall small warp-groups When WQ reaches a threshold (lower than High_Watermark) Drain singleton warp-groups only Reduce write-induced latency SC 2014 19

WG-scheduling GMC Baseline WG Latency Divergence WG-Bw WG-M WG-W Ideal Bandwidth Utilization SC 2014 20

Methodology GPGPUSim v3.1 : Cycle Accurate GPGPU simulator SM Cores 30 Max Threads/Core 1024 Warp Size 32 Threads/warp L1 / L2 32KB / 128 KB DRAM 6Gbps GDDR5 DRAM Channels Banks 6 Channels 16 Banks/channel USIMM v1.3 : Cycle Accurate DRAM Simulator modified to model GMC-baseline & GDDR5 timings Irregular and Regular workloads from Parboil, Rodinia, Lonestar, and MARS. SC 2014 21

Performance Improvement Restored Bandwidth Utilization 12% Reduced Latency Divergence 10% IPC NORMALIZED TO 8% BASELINE 6% 4% 2% 0% WG WG-M WG-Bw WG-W SC 2014 22

Impact on Regular Workloads Effective coalescing High spatial locality in warp-group WG scheduling works similar to GMC-baseline No performance loss WG-Bw and WG-W provide Minor benefits SC 2014 23

Energy Impact of Reduced Row Hit-Rate Scheduling Row-misses over Row- hits Reduces the row-buffer hit rate 16% GDDR5 Energy/bit 40 35 30 In GDDR5, power consumption dominated by I/O. 25 pJ/bit 20 15 10 Increase in DRAM power negligible compared to execution speed-up Net improvement in system energy 5 0 Baseline WG-Bw I/O Row Column Control DLL Background SC 2014 24

Conclusions Irregular applications place new demands on the GPU s memory system Memory scheduling can alleviate the issues caused by latency divergence Carefully orchestrating the scheduling of commands can help regain the bandwidth lost by warp-aware scheduling Future techniques must also include the cache-hierarchy in reducing latency divergence SC 2014 25

Thanks ! SC 2014 26

Backup Slides SC 2014 27

Performance Improvement : IPC SC 2014 28

Average Warp Stall Latency SC 2014 29

DRAM Latency Divergence SC 2014 30

Bandwidth Utilization SC 2014 31

Memory Controller Microarchitecture SC 2014 32

Warp-Group Scheduling Every batch assigned a priority-score completion time of the longest request Higher priority to warp groups with Few requests High spatial locality Lightly loaded banks Priorities updated after each warp-group scheduling Warp-group with lowest service time selected Shortest-job-first based on actual service time, not number of requests SC 2014 33

Managing DRAM Latency Divergence in Irregular GPGPU Applications

Download Presentation

Presentation Transcript

Related

More Related Content