GPU Scheduling Strategies: Maximizing Performance with Cache-Conscious Wavefront Scheduling

 
Warp Scheduling Basics
 
 
Loose Round Robin (LRR)
 
Goes around to every warp
and issue if ready (R)
 
If warp is not ready (W),
skip and issue next ready warp
 
Issue: Warps all run at the same speed,
potentially all reaching memory access
phase together and stalling.
R
R
W
R
R
R
R
Select
 
All Warps
Execution Units
W
 
. . .
 
Two-level (TL)
 
Warps are grouped into two groups:
Pending warps
(potentially waiting on long latency instr.)
Active warps
(Ready to execute)
Warps move between Pending and Active groups
Within the Active group, issue LRR
Goal: Overlap warps performing computation
with warps performing memory access
 
Greedy-then-oldest (GTO)
 
Schedule from a single warp until it stalls
 
Then pick the oldest warp
(time warp assigned to core)
 
Goal: Improve cache locality for
greedy warp
R
R
W
R
R
R
R
Select
 
All Warps
Execution Units
W
 
. . .
Cache-Conscious Wavefront Scheduling
Timothy G. Rogers
1
Mike O’Connor
2
Tor M. Aamodt
1
1
The University of British Columbia
2
AMD Research
Compute Unit
DRAM
DRAM
Wavefronts and Caches
 
Threads in
Wavefront
Compute Unit
W
1
 
 
Wavefront
Scheduler
W
2
DRAM
 
 
10’s of thousands concurrent threads
High bandwidth memory system
Include data caches
 
ALU
ALU
ALU
 
 
L2 cache
Memory
Unit
L1D
 
High Level Overview of a GPU
Motivation
 
Improve performance of highly parallel applications with irregular or
data dependent access patterns on GPU
 
These workloads can be highly cache-sensitive
 
Increase 32k L1D to 8M
M
i
n
i
m
u
m
 
3
x
 
s
p
e
e
d
u
p
M
e
a
n
 
s
p
e
e
d
u
p
 
>
5
x
 
Breadth First Search (BFS)
K-Means (KMN)
Memcached-GPU (MEMC)
Parallel Garbage Collection (GC)
Data Cache
Data Cache
Where does the locality come from?
 
Classify two types of locality
 
Intra-wavefront locality
 
Inter-wavefront locality
 
LD $line (X)
 
LD $line (X)
 
LD $line (X)
 
LD $line (X)
 
Wave
0
Hit
 
Wave
0
 
Wave
1
Hit
 
 
0
 
2
0
 
4
0
 
6
0
 
8
0
 
1
0
0
 
1
2
0
 
A
V
G
-
H
i
g
h
l
y
 
C
a
c
h
e
 
S
e
n
s
i
t
i
v
e
 
(
H
i
t
s
/
M
i
s
s
)
 
P
K
I
M
i
s
s
e
s
 
P
K
I
I
n
t
e
r
-
W
a
v
e
f
r
o
n
t
 
H
i
t
s
 
P
K
I
I
n
t
r
a
-
W
a
v
e
f
r
o
n
t
 
H
i
t
s
 
P
K
I
Quantifying intra-/inter-wavefront locality
Observation
Issue-level scheduler chooses the access stream
Memory
System
Wavefront
Scheduler
Wavefront
Scheduler
 
Round Robin Scheduler
Memory
System
 
Greedy then Oldest Scheduler
 
ld A
 
,B,C,D…
 
D
C
B
A
 
ld Z,Y,X,W
 
ld A,B,C,D
 
W
X
Y
Z
 
...
 
...
 
ld Z,Y,X,W
 
D
C
B
A
 
D
C
B
A
 
ld A,B,C,D…
 
...
 
Wave
0
 
Wave
1
 
Wave
0
 
Wave
1
 
ld A,B,C,D…
 
A,B,C,D
 
E,F,G,H
 
I,J,K,L
 
A,B,C,D
 
E,F,G,H
 
I,J,K,L
 
W
0
 
W
1
 
W
2
 
W
0
 
W
1
 
W
2
 
Optimal Replacement using RR scheduler
 
LRU replacement
 
A,B,C,D
 
W
0
 
A,B,C,D
 
W
0
 
E,F,G,H
 
W
1
 
E,F,G,H
 
W
1
 
I,J,K,L
 
W
2
 
I,J,K,L
 
W
2
A
B
C
D
 
4 hits
 
12 hits
E
F
L
 
Difficult Access Stream
 
Need a better replacement Policy?
Why miss rate is more sensitive to scheduling than replacement
 
 
1024 threads = thousands of memory accesses
 
 
 
1
2
A
Wavefront
Scheduler
Replacement
Policy
 
Ld A
 
Ld A
 
Ld B
 
Ld C
 
Ld C
 
Ld D
 
Ld E
 
Ld E
 
Ld F
 
W
0
 
W
1
 
W
31
 
Decision picks from thousands of potential accesses
 
Decision limited to one of A possible ways
 
Does this ever Happen?
 
Loose Round Robin with LRU
Belady Optimal
Greedy Then Oldest with LRU
 
Consider two simple schedulers
 
M
P
K
I
Key Idea
Use the wavefront scheduler to 
shape
 the access pattern
Memory
System
Wavefront
Scheduler
Wavefront
Scheduler
 
Greedy then Oldest Scheduler
Memory
System
 
Cache-Conscious Wavefront Scheduler
 
ld A,B,C,D
 
D
C
B
A
 
ld Z,Y,X,W
 
ld A,B,C,D
 
W
X
Y
Z
 
...
 
...
 
ld Z,Y,X,W
 
D
C
B
A
 
D
C
B
A
 
ld Z,Y,X,W…
 
ld A,B,C,D…
 
...
 
...
 
ld Z,Y,X,W…
 
Wave
0
 
Wave
1
 
Wave
0
 
Wave
1
 
ld A,B,C,D…
 
W
X
Y
Z
 
W
X
Y
Z
 
Time
CCWS Components
 
Locality Scoring System
 
Balances cache miss rate and
overall throughput
 
Lost Locality Detector
W
0
W
1
W
2
W
0
W
1
W
2
Victim Tags
Tag
Tag
Tag
Tag
Tag
Tag
 
W
0
 
W
1
 
W
2
 
Detects when wavefronts have
lost intra-wavefront locality
L1 victim tags organized by
wavefront ID
More Details in the
Paper
 
Score
CCWS Implementation
Memory Unit
Cache
Victim Tags
Locality Scoring
System
Wave
Scheduler
W
0
W
1
W
2
Tag
WID
Data
Tag
Tag
Tag
Tag
Tag
Tag
 
W
0
 
W
1
 
W
2
 
Time
 
Score
Tag
WID
Data
 
W
0
W
1
W
2
 
No W
2
loads
W
0
W
1
W
2
 
 
W
0
: ld X
X
0
 
W
0
,X
X
 
W
0
detected
lost
locality
 
W
2
: ld Y
 
W
0
: ld X
 
Probe
W
0
,X
Y
2
More Details in the
Paper
Methodology
 
GPGPU-Sim (version 3.1.0)
 
30 Compute Units (1.3 GHz)
32 wavefront contexts (1024 threads total)
32k L1D cache per compute unit
8-way
128B lines
LRU replacement
1M L2 unified cache
 
Stand Alone GPGPU-Sim Cache Simulator
 
Trace-based cache simulator
Fed GPGPU-Sim traces
Used for oracle replacement
Performance Results
 
Also Compared Against
 
A 2-LVL scheduler
Similar to GTO
performance
A profile-based oracle
scheduler
Application and input
data dependent
CCWS captures 86% of
oracle scheduler
performance
Variety of cache-insensitive
benchmarks
No performance
degradation
 
0
0
.
5
1
1
.
5
2
H
M
E
A
N
-
H
i
g
h
l
y
 
C
a
c
h
e
-
S
e
n
s
i
t
i
v
e
 
S
p
e
e
d
u
p
L
R
R
G
T
O
C
C
W
S
Cache Miss Rate
 
CCWS less cache
misses than other
schedulers optimally
replaced
Full Sensitivity
Study in Paper
 
M
P
K
I
Related Work
 
Wavefront Scheduling
 
Gerogia Tech - GPGPU Workshop 2010
UBC -
 
HPCA 2011
UT Austin
 - 
MICRO 2011
UT Austin/NVIDIA/UIUC/Virginia - ISCA 2011
 
OS-Level Scheduling
 
SFU – ASPLOPS 2010
Intel/MIT – ASPLOPS 2012
Conclusion
 
Different approach to fine-grained cache management
Good for power and performance
High level insight not tied specifics of a GPU
Any system with many threads sharing a cache can
potentially benefit
Questions?
Slide Note
Embed
Share

Explore GPU scheduling strategies including Loose Round Robin (LRR) for maximizing performance by efficiently managing warps, Cache-Conscious Wavefront Scheduling for improved cache utilization, and Greedy-then-oldest (GTO) scheduling to enhance cache locality. Learn how these techniques optimize GPU resource utilization to boost application performance.

  • GPU Scheduling
  • Cache-Conscious
  • Wavefront Scheduling
  • Performance Optimization
  • Greedy-then-oldest

Uploaded on Jul 30, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Warp Scheduling Basics

  2. Loose Round Robin (LRR) Goes around to every warp and issue if ready (R) All Warps W . . . If warp is not ready (W), skip and issue next ready warp R R W R R R R Select Execution Units Issue: Warps all run at the same speed, potentially all reaching memory access phase together and stalling.

  3. Two-level (TL) Warps are grouped into two groups: Pending warps (potentially waiting on long latency instr.) Active warps (Ready to execute) Warps move between Pending and Active groups Within the Active group, issue LRR Goal: Overlap warps performing computation with warps performing memory access Pending Warps P . . . P P P P P P P Active Warps . . . A A A A Select Execution Units

  4. Greedy-then-oldest (GTO) Schedule from a single warp until it stalls All Warps Then pick the oldest warp (time warp assigned to core) W . . . R R W R R R R Select Goal: Improve cache locality for greedy warp Execution Units

  5. Cache-Conscious Wavefront Scheduling Timothy G. Rogers1 Mike O Connor2 Tor M. Aamodt1 1The University of British Columbia 2AMD Research

  6. Wavefronts and Caches High Level Overview of a GPU 10 s of thousands concurrent threads High bandwidth memory system Include data caches DRAM DRAM DRAM L2 cache Threads in Wavefront Compute Unit Compute Unit W1 W2 Memory Unit L1D ALU ALU ALU Wavefront Scheduler Tim Rogers Cache-Conscious Wavefront Scheduling 6

  7. Motivation Improve performance of highly parallel applications with irregular or data dependent access patterns on GPU Breadth First Search (BFS) K-Means (KMN) Memcached-GPU (MEMC) Parallel Garbage Collection (GC) These workloads can be highly cache-sensitive Increase 32k L1D to 8M Minimum 3x speedup Mean speedup >5x Tim Rogers Cache-Conscious Wavefront Scheduling 7

  8. Where does the locality come from? Classify two types of locality Intra-wavefront locality Inter-wavefront locality Wave0 Wave1 Wave0 LD $line (X) LD $line (X) LD $line (X) LD $line (X) Hit Hit Data Cache Data Cache Tim Rogers Cache-Conscious Wavefront Scheduling 8

  9. Quantifying intra-/inter-wavefront locality 120 Misses PKI 100 Inter-Wavefront Hits PKI (Hits/Miss) PKI 80 60 Intra-Wavefront Hits PKI 40 20 0 AVG-Highly Cache Sensitive Tim Rogers Cache-Conscious Wavefront Scheduling 9

  10. Observation Issue-level scheduler chooses the access stream Greedy then Oldest Scheduler Wave0 Wave1 Round Robin Scheduler Wave1 Wave0 Wavefront Scheduler Wavefront Scheduler ld A,B,C,D ld Z,Y,X,W ld A,B,C,D D C B A ... ... ... W X Y Z ld A,B,C,D ld Z,Y,X,W ld A,B,C,D D C B A D C B A Memory System Memory System Tim Rogers Cache-Conscious Wavefront Scheduling 10

  11. Difficult Access Stream Need a better replacement Policy? Optimal Replacement using RR scheduler 4 hits E,F,G,H A,B,C,D A,B,C,D E,F,G,H I,J,K,L I,J,K,L W0 W1 W0 W2 W1 W2 D E F L A B C LRU replacement 12 hits E,F,G,H A,B,C,D E,F,G,H I,J,K,L A,B,C,D I,J,K,L W0 W1 W0 W1 W2 W2 Tim Rogers Cache-Conscious Wavefront Scheduling 11

  12. Why miss rate is more sensitive to scheduling than replacement 1024 threads = thousands of memory accesses Ld A Ld C Ld E Ld B Ld D Ld F Ld A Ld C Ld E 1 2 A W0 W1 W31 Replacement Policy Decision limited to one of A possible ways Wavefront Scheduler Decision picks from thousands of potential accesses Tim Rogers Cache-Conscious Wavefront Scheduling 12

  13. Does this ever Happen? Consider two simple schedulers 90 LRR GTO LRR-BEL 80 70 Loose Round Robin with LRU Belady Optimal Greedy Then Oldest with LRU 60 MPKI 50 40 30 20 10 0 AVG-Highly Cache-Sensitive Tim Rogers Cache-Conscious Wavefront Scheduling 13

  14. Key Idea Use the wavefront scheduler to shape the access pattern Greedy then Oldest Scheduler Wave0 Wave1 Cache-Conscious Wavefront Scheduler Wave0 Wave1 Wavefront Scheduler Wavefront Scheduler ld A,B,C,D ld Z,Y,X,W ld Z,Y,X,W ld A,B,C,D D C B A W X Y W X Y Z ... ... ... W X Y Z ... ld A,B,C,D ld Z,Y,X,W ld Z,Y,X,W ld A,B,C,D D C B A Z D C B A Memory System Memory System Tim Rogers Cache-Conscious Wavefront Scheduling 14

  15. CCWS Components W2 Locality Scoring System W1 Balances cache miss rate and overall throughput W2 Score W1 W0 W0 Time Lost Locality Detector Detects when wavefronts have lost intra-wavefront locality L1 victim tags organized by wavefront ID Victim Tags W0 W1 W2 Tag Tag Tag Tag Tag Tag Tim Rogers Cache-Conscious Wavefront Scheduling 15

  16. CCWS Implementation W2 No W2 loads W2 W1 W2 Locality Scoring System Wave Scheduler Score W1 W1 W0: ld X W2: ld Y W0: ld X W0 W0 W0 Memory Unit Time Cache Tag X Y W0 WID 0 2 Data detected lost locality Tag WID Data Victim Tags W0 W1 W2 W0,X Probe W0,X Tag X Tag Tag Tag Tag Tag Tim Rogers Cache-Conscious Wavefront Scheduling 16

  17. Methodology GPGPU-Sim (version 3.1.0) 30 Compute Units (1.3 GHz) 32 wavefront contexts (1024 threads total) 32k L1D cache per compute unit 8-way 128B lines LRU replacement 1M L2 unified cache Stand Alone GPGPU-Sim Cache Simulator Trace-based cache simulator Fed GPGPU-Sim traces Used for oracle replacement Tim Rogers Cache-Conscious Wavefront Scheduling 17

  18. Performance Results Also Compared Against 2 LRR GTO CCWS A 2-LVL scheduler Similar to GTO performance A profile-based oracle scheduler Application and input data dependent CCWS captures 86% of oracle scheduler performance Variety of cache-insensitive benchmarks No performance degradation 1.5 Speedup 1 0.5 0 HMEAN-Highly Cache-Sensitive Tim Rogers Cache-Conscious Wavefront Scheduling 18

  19. Cache Miss Rate 90 LRR GTO CCWS LRR-BEL GTO-BEL 80 70 60 CCWS less cache misses than other schedulers optimally replaced MPKI 50 40 30 20 10 0 AVG-Highly Cache-Sensitive Tim Rogers Cache-Conscious Wavefront Scheduling 19

  20. Related Work Wavefront Scheduling Gerogia Tech - GPGPU Workshop 2010 UBC - HPCA 2011 UT Austin - MICRO 2011 UT Austin/NVIDIA/UIUC/Virginia - ISCA 2011 OS-Level Scheduling SFU ASPLOPS 2010 Intel/MIT ASPLOPS 2012 Tim Rogers Cache-Conscious Wavefront Scheduling 20

  21. Conclusion Different approach to fine-grained cache management Good for power and performance High level insight not tied specifics of a GPU Any system with many threads sharing a cache can potentially benefit Questions? Tim Rogers Cache-Conscious Wavefront Scheduling 21

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#