GPU Programming Models and Execution Architecture

undefined
Amir Hormati
, 
Mehrzad Samadi, Mark Woh,
Trevor Mudge, and Scott Mahlke
S
p
o
n
g
e
:
 
P
o
r
t
a
b
l
e
 
S
t
r
e
a
m
P
r
o
g
r
a
m
m
i
n
g
 
o
n
 
G
r
a
p
h
i
c
s
 
E
n
g
i
n
e
s
W
h
y
 
G
P
U
s
?
Every mobile and desktop
system will have one
Affordable and
 high
performance
Over-provisioned
Programmable
Sony PlayStation Phone
G
P
U
 
A
r
c
h
i
t
e
c
t
u
r
e
S
h
a
r
e
d
R
e
g
s
0
1
2
3
4
5
6
7
I
n
t
e
r
c
o
n
n
e
c
t
i
o
n
 
N
e
t
w
o
r
k
C
P
U
S
M
 
0
S
M
 
1
S
M
 
2
9
Kernel 1
Kernel 2
Time
0
1
2
3
4
5
6
7
S
h
a
r
e
d
R
e
g
s
0
1
2
3
4
5
6
7
S
h
a
r
e
d
R
e
g
s
0
1
2
3
4
5
6
7
R
e
g
i
s
t
e
r
s
G
l
o
b
a
l
 
M
e
m
o
r
y
 
(
D
e
v
i
c
e
 
M
e
m
o
r
y
)
S
h
a
r
e
d
 
M
e
m
o
r
y
G
P
U
 
P
r
o
g
r
a
m
m
i
n
g
 
M
o
d
e
l
Threads 
Blocks 
 Grid
All the threads run one kernel
Registers private to each thread
Registers spill to local memory
S
hared memory shared between
threads of a block
Global memory 
shared between
all blocks
G
P
U
 
E
x
e
c
u
t
i
o
n
 
M
o
d
e
l
G
P
U
 
E
x
e
c
u
t
i
o
n
 
M
o
d
e
l
B
l
o
c
k
 
0
B
l
o
c
k
 
1
B
l
o
c
k
 
3
B
l
o
c
k
 
2
G
P
U
 
P
r
o
g
r
a
m
m
i
n
g
 
C
h
a
l
l
e
n
g
e
s
O
p
t
i
m
i
z
e
d
 
f
o
r
G
e
F
o
r
c
e
 
G
T
X
 
2
8
5
O
p
t
i
m
i
z
e
d
 
f
o
r
G
e
F
o
r
c
e
 
8
4
0
0
 
G
S
Data restructuring for complex memory
hierarchy efficiently
Global memory, Shared memory, Registers
Partitioning work between CPU and GPU
Lack of portability between different
generations of GPU
Registers, active warps, size of global
memory, size of shared memory
Will vary even more
Newer high performance cards e.g.
NVIDA’s Fermi
Mobile GPUs with less resources
N
o
n
l
i
n
e
a
r
 
O
p
t
i
m
i
z
a
t
i
o
n
 
S
p
a
c
e
[Ryoo , CGO ’08]
 
SAD Optimization Space
 
908 Configurations
We need higher level of abstraction!
G
o
a
l
s
Write-once parallel software
Free the programmer from low-level details
(
C
 
+
 
P
t
h
r
e
a
d
s
)
 
S
h
a
r
e
d
 
M
e
m
o
r
y
P
r
o
c
e
s
s
o
r
s
(
C
 
+
I
n
t
r
i
n
s
i
c
s
)
 
S
I
M
D
 
E
n
g
i
n
e
s
(
V
e
r
i
l
o
g
/
V
H
D
L
)
 
F
P
G
A
s
(
C
U
D
A
/
O
p
e
n
C
L
)
 
G
P
U
s
P
a
r
a
l
l
e
l
S
p
e
c
i
f
i
c
a
t
i
o
n
S
t
r
e
a
m
i
n
g
Higher-level of abstraction
Decoupling computation and memory accesses
Coarse grain exposed parallelism, exposed
communication
Programmers can focus on the algorithms
instead of low-level details
Streaming actors use buffers to communicate
A lot of recent works on extending portability of
streaming applications
S
p
o
n
g
e
Generating optimized CUDA for a wide
variety of GPU targets
Perform an array of optimizations
 
on stream graphs
Optimizing and porting to different
generations
Utilize memory hierarchy (registers,
shared memory, coallescing)
Efficiently utilize streaming cores
G
P
U
 
P
e
r
f
o
r
m
a
n
c
e
 
M
o
d
e
l
- Memory bound Kernels
M 0
M 1
M 2
M 3
M 4
M 5
M 6
M 7
C 0
C 1
C 2
C 3
C 4
C 5
C 6
C 7
≈ Memory Time
- Computation bound Kernels
M 0
M 1
M 4
M 5
M 2
M 3
M 6
M 7
C 0
C1
C 2
C 3
C 4
C 5
C 6
C 7
≈ Computation Time
M
C
Memory Instructions
Computation Instructions
A
c
t
o
r
 
C
l
a
s
s
i
f
i
c
a
t
i
o
n
High Traffic Actors(HiT)
Large number of memory accesses per actor
Less number of threads with shared memory
Using shared memory underutilizes the processors
Low Traffic Actors(LoT)
Less number of memory accesses per actor
More number of threads
Using shared memory increases the performance
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
2
T
h
r
e
a
d
 
3
T
h
r
e
a
d
 
0
G
l
o
b
a
l
 
M
e
m
o
r
y
 
A
c
c
e
s
s
e
s
A
[
4
,
4
]
G
l
o
b
a
l
 
M
e
m
o
r
y
G
l
o
b
a
l
 
M
e
m
o
r
y
Large access latency
Not access the words in sequence
No coalescing
A
[
4
,
4
]
A
[
4
,
4
]
A
[
4
,
4
]
A
[
i
,
 
j
]
 
 
 
 
 
 
 
T
h
r
e
a
d
 
3
T
h
r
e
a
d
 
2
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
0
S
h
a
r
e
d
 
M
e
m
o
r
y
A
[
4
,
4
]
A
[
4
,
4
]
A
[
4
,
4
]
A
[
4
,
4
]
Shared Memory
Shared Memory
G
l
o
b
a
l
 
T
o
S
h
a
r
e
d
G
l
o
b
a
l
 
T
o
S
h
a
r
e
d
G
l
o
b
a
l
 
T
o
S
h
a
r
e
d
G
l
o
b
a
l
 
T
o
S
h
a
r
e
d
Global Memory
G
l
o
b
a
l
 
M
e
m
o
r
y
First bring the data into shared memory with
coalescing
Each filter brings data for other filters
Satisfies coalescing constraints
After data is in the shared memory, then each
filter accesses its own memory.
Improve bandwidth and performance
S
h
a
r
e
d
 
t
o
G
l
o
b
a
l
S
h
a
r
e
d
 
t
o
G
l
o
b
a
l
S
h
a
r
e
d
 
t
o
G
l
o
b
a
l
S
h
a
r
e
d
 
t
o
G
l
o
b
a
l
U
s
i
n
g
 
S
h
a
r
e
d
 
M
e
m
o
r
y
Shared memory is 100x
faster than global
memory
Coalesce all global
memory accesses
Number of threads is
limited by size of the
shared memory.
H
e
l
p
e
r
 
T
h
r
e
a
d
s
Shared memory limits
the number of threads.
Underutilized processors
can fetch data.
All the helper threads are
in one warp. (no control
flow divergence)
D
a
t
a
 
P
r
e
f
e
t
c
h
Better register utilization
Data for iteration 
i+1
 is
moved to registers
Data for iteration 
i
 is moved
from register to shared
memory
Allows the GPU to overlap
instructions
L
o
o
p
 
u
n
r
o
l
l
i
n
g
Similar to traditional unrolling
Allows the GPU to overlap
instructions
Better register utilization
Less loop control overhead
Can also be applied to
memory transfer loops
M
e
t
h
o
d
o
l
o
g
y
Set of benchmarks from the StreamIt Suite
3GHz Intel Core 2 Duo CPU with 6GB RAM
Nvidia Geforce GTX 285
R
e
s
u
l
t
 
(
B
a
s
e
l
i
n
e
 
C
P
U
)
 
1
0
 
2
4
R
e
s
u
l
t
 
(
B
a
s
e
l
i
n
e
 
G
P
U
)
6
4
%
 
3
%
 
1
6
%
 
1
6
%
C
o
n
c
l
u
s
i
o
n
Future systems will be heterogeneous
GPUs are important part of such systems
Programming complexity is a significant challenge
Sponge automatically creates optimized CUDA code for a wide
variety of GPU targets
Provide portability by performing an array of optimizations on stream
graphs
Q
u
e
s
t
i
o
n
s
 
S
p
a
t
i
a
l
 
I
n
t
e
r
m
e
d
i
a
t
e
 
R
e
p
r
e
s
e
n
t
a
t
i
o
n
StreamIt
Main Constructs:
Filter 
 

Pipeline 
 


Splitjoin  
 


 
  
Exposes different types of
parallelism
Composable, hierarchical
Stateful and stateless filters
f
i
l
t
e
r
N
o
n
l
i
n
e
a
r
 
O
p
t
i
m
i
z
a
t
i
o
n
 
S
p
a
c
e
[Ryoo , CGO ’08]
SAD Optimization Space
908 Configurations
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
2
T
h
r
e
a
d
 
0
B
a
n
k
 
C
o
n
f
l
i
c
t
A[8,8]
A[8,8]
A[8,8]
Shared Memory
Shared Memory
Conflict
27
data = buffer[BaseAddress + s * ThreadId]
T
h
r
e
a
d
 
2
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
0
R
e
m
o
v
i
n
g
 
B
a
n
k
 
C
o
n
f
l
i
c
t
A[8,8]
A[8,8]
A[8,8]
Shared Memory
Shared Memory
28
data = buffer[BaseAddress + s * ThreadId]
Slide Note
Embed
Share

Explore the world of GPU programming with insights into GPU architecture, programming models, and execution models. Discover the evolution of GPUs and their importance in graphics engines and high-performance computing, as discussed by experts from the University of Michigan.

  • GPU programming
  • Architecture
  • Execution model
  • Graphics engines
  • University of Michigan

Uploaded on Sep 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Sponge: Portable Stream Programming on Graphics Engines Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke University of Michigan Electrical Engineering and Computer Science

  2. Why GPUs? Every mobile and desktop system will have one Affordable and high performance Sony PlayStation Phone 1500 GeForce GTX 480 NVIDIA GPU 1250 Theoretical GFLOPS/s INTEL CPU 1000 GeForce GTX 280 750 Over-provisioned GeForce 8800 GTX 500 GeForce 7800 GTX GeForce 6800 Ultra 250 0 Programmable 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 University of Michigan 2 Electrical Engineering and Computer Science

  3. GPU Architecture SM 0 SM 1 SM 29 Shared 0 2 4 6 Shared 0 2 4 6 Shared 0 2 4 6 CPU 1 3 5 7 1 3 5 7 1 3 5 7 Shared Memory 0 1 Kernel 1 Time Regs Regs Regs 2 3 4 5 7 6 Interconnection Network Registers Kernel 2 Global Memory (Device Memory) University of Michigan 3 Electrical Engineering and Computer Science

  4. GPU Programming Model int RegisterVar Thread Block Per-block Register Threads Blocks Grid Per-block Shared Memory Per-block Shared Memory __shared__ int SharedVar int LocalVarArray[10] All the threads run one kernel Grid 0 Grid Sequence Registers private to each thread Per-app Device Global Memory Grid 1 Registers spill to local memory Shared memory shared between threads of a block __shared__ int GlobalVar Thread Thread Global memory shared between all blocks Per-thread Local Memory Per-thread Register int LocalVarArray[10] University of Michigan 4 Electrical Engineering and Computer Science

  5. GPU Execution Model SM 0 SM 1 SM 2 SM 3 SM 30 Shared 0 2 Shared 0 2 Shared 0 2 Shared 0 2 Shared 0 2 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 4 6 4 6 4 6 4 6 4 6 Regs Regs Regs Regs Regs Grid 1 University of Michigan 5 Electrical Engineering and Computer Science

  6. GPU Execution Model Block 0 SM0 ThreadId Shared 0 Warp 0 31 32 Warp 1 63 1 0 Block 1 2 3 4 5 Block 2 6 7 Registers Block 3 University of Michigan 6 Electrical Engineering and Computer Science

  7. GPU Programming Challenges Data restructuring for complex memory hierarchy efficiently Global memory, Shared memory, Registers 400 High Performance Desktop Mobile 350 8 Partitioning work between CPU and GPU 16 300 Lack of portability between different generations of GPU Registers, active warps, size of global memory, size of shared memory 250 Time (ms) 200 32 48 150 Optimized for GeForce 8400 GS 64 Will vary even more Newer high performance cards e.g. NVIDA s Fermi Mobile GPUs with less resources 100 Optimized for GeForce GTX 285 50 0 Number of Registers Per Thread University of Michigan 7 Electrical Engineering and Computer Science

  8. Nonlinear Optimization Space SAD Optimization Space 908 Configurations We need higher level of abstraction! [Ryoo , CGO 08] University of Michigan 8 Electrical Engineering and Computer Science

  9. Goals Write-once parallel software Free the programmer from low-level details (C + Pthreads) Shared Memory Processors (C +Intrinsics) SIMD Engines Parallel Specification (Verilog/VHDL) FPGAs (CUDA/OpenCL) GPUs University of Michigan 9 Electrical Engineering and Computer Science

  10. Streaming Higher-level of abstraction Actor 1 Decoupling computation and memory accesses Splitter Coarse grain exposed parallelism, exposed communication Actor 2 Actor 3 Actor 4 Actor 5 Programmers can focus on the algorithms instead of low-level details Joiner Streaming actors use buffers to communicate Actor 6 A lot of recent works on extending portability of streaming applications University of Michigan 10 Electrical Engineering and Computer Science

  11. Sponge Generating optimized CUDA for a wide variety of GPU targets Reorganization and Classification Perform an array of optimizations on stream graphs Shared/Global Memory Memory Layout Helper Threads Optimizing and porting to different generations Bank Conflict Resolution Graph Restructuring Utilize memory hierarchy (registers, shared memory, coallescing) Software Prefetching Register Optimization Loop Unrolling Efficiently utilize streaming cores University of Michigan 11 Electrical Engineering and Computer Science

  12. GPU Performance Model - Memory bound Kernels Memory Time M 0 M 1 M 2 M 3 M 4 M 5 M 6 M 7 C 7 C 0 C 1 C 2 C 3 C 4 C 5 C 6 - Computation bound Kernels Computation Time M 0 M 1 C 0 M 2 M 3 M 4 M 5 C 3 M 6 C 4 M 7 C1 C 2 C 5 C 6 C 7 Memory Instructions Computation Instructions M C University of Michigan 12 Electrical Engineering and Computer Science

  13. Actor Classification High Traffic Actors(HiT) Large number of memory accesses per actor Less number of threads with shared memory Using shared memory underutilizes the processors Low Traffic Actors(LoT) Less number of memory accesses per actor More number of threads Using shared memory increases the performance University of Michigan 13 Electrical Engineering and Computer Science

  14. Global Memory Accesses Large access latency Not access the words in sequence No coalescing A[i, j] pushes Actor A has i pops and j Global Memory 3 3 7 7 11 11 15 15 0 0 1 1 2 2 4 4 5 5 6 6 8 8 9 9 10 10 12 12 13 13 14 14 A[4,4] A[4,4] A[4,4] A[4,4] Thread 0 Thread 1 Thread 2 Thread 3 Global Memory 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 University of Michigan 14 Electrical Engineering and Computer Science

  15. Shared Memory Global Memory 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Global To Shared Global To Shared Global To Shared Global To Shared Shared Memory 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 A[4,4] A[4,4] A[4,4] A[4,4] Shared Memory 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 0 0 1 1 2 2 3 3 Shared to Global Shared to Global Shared to Global Shared to Global Thread 0 1 1 Thread 1 5 5 Thread 2 9 9 Thread 3 13 13 Global Memory 4 4 6 6 7 7 8 8 10 10 11 11 12 12 14 14 15 15 0 0 2 2 3 3 University of Michigan 15 Electrical Engineering and Computer Science

  16. Using Shared Memory Begin Kernel <<<Blocks, Threads>>>: Begin Kernel <<<Blocks, Threads>>>: Shared memory is 100x faster than global memory For number of iterations For number of pops Shared For number of iterations Global syncthreads Work Work Coalesce all global memory accesses Number of threads is limited by size of the shared memory. syncthreads For number of pushs End Kernel Shared Global End Kernel University of Michigan 16 Electrical Engineering and Computer Science

  17. Helper Threads Shared memory limits the number of threads. Begin Kernel <<< Blocks, Threads + Helper >>>: Begin Kernel <<<Blocks, Threads>>>: For number of iterations If helper threads Shared For number of iterations Global Underutilized processors can fetch data. syncthreads If worker threads Work Work All the helper threads are in one warp. (no control flow divergence) syncthreads If helper threads Shared Global End Kernel End Kernel University of Michigan 17 Electrical Engineering and Computer Science

  18. Data Prefetch Begin Kernel <<<Blocks, Threads>>>: Begin Kernel <<<Blocks, Threads>>>: For number of iterations For number of pops Better register utilization For number of pops Regs Global For number of iterations For number of pops Shared syncthreads Global Data for iteration i+1 is moved to registers Shared Regs Work syncthreads If not the last iteration For number of pops Regs Data for iteration i is moved from register to shared memory Global Work syncthreads For number of pushs Shared Global syncthreads For number of pushs Shared Allows the GPU to overlap instructions Global End Kernel End Kernel University of Michigan 18 Electrical Engineering and Computer Science

  19. Loop unrolling Similar to traditional unrolling Begin Kernel <<<Blocks, Threads>>>: For number of iterations/2 For number of pops Shared Global Allows the GPU to overlap instructions syncthreads Work syncthreads For number of pushs Shared Better register utilization Global For number of pops Shared Global Less loop control overhead syncthreads Work Can also be applied to memory transfer loops syncthreads For number of pushs Shared Global End Kernel University of Michigan 19 Electrical Engineering and Computer Science

  20. Methodology Set of benchmarks from the StreamIt Suite 3GHz Intel Core 2 Duo CPU with 6GB RAM Nvidia Geforce GTX 285 Stream Processors 240 Processor clock 1476 MHz Memory Configuration 2GB DDR3 Memory Bandwidth 159.0 GB/s University of Michigan 20 Electrical Engineering and Computer Science

  21. Result (Baseline CPU) With Transfer Without Transfer 50 45 40 35 30 Speedup(X) 24 25 20 15 10 10 5 0 University of Michigan 21 Electrical Engineering and Computer Science

  22. Result (Baseline GPU) Shared/Global Prefetch/Unrolling Helper Threads Graph Restructuring 7 6 5 Speedup(X) 4 16% 16% 3 3% 2 64% 1 0 University of Michigan 22 Electrical Engineering and Computer Science

  23. Conclusion Future systems will be heterogeneous GPUs are important part of such systems Programming complexity is a significant challenge Sponge automatically creates optimized CUDA code for a wide variety of GPU targets Provide portability by performing an array of optimizations on stream graphs University of Michigan 23 Electrical Engineering and Computer Science

  24. Questions University of Michigan 24 Electrical Engineering and Computer Science

  25. Spatial Intermediate Representation StreamIt Main Constructs: Filter Encapsulate computation. Pipeline Expressing pipeline parallelism. Splitjoin Expressing task-level parallelism. Other constructs not relevant here Exposes different types of parallelism Composable, hierarchical Stateful and stateless filters filter pipeline splitjoin University of Michigan 25 Electrical Engineering and Computer Science

  26. Nonlinear Optimization Space SAD Optimization Space 908 Configurations [Ryoo , CGO 08] University of Michigan 26 Electrical Engineering and Computer Science

  27. Bank Conflict Conflict Shared Memory 10 10 2 2 9 9 2 2 8 8 1 1 1 1 0 0 3 4 5 6 7 11 12 13 14 15 0 0 3 4 5 6 7 A[8,8] A[8,8] A[8,8] Thread 0 Thread 1 Thread 2 Shared Memory 10 10 2 2 9 9 2 2 8 8 1 1 1 1 0 0 0 0 3 4 5 6 7 11 12 13 14 15 3 4 5 6 7 data = buffer[BaseAddress + s * ThreadId] University of Michigan 27 27 Electrical Engineering and Computer Science

  28. Removing Bank Conflict if GCD( # of bank, s) is 1 there will be no bank conflict s must be odd Shared Memory 11 11 9 9 10 10 2 2 4 4 0 0 1 1 3 4 5 6 7 8 12 13 14 15 0 1 2 2 3 3 5 6 7 A[8,8] A[8,8] A[8,8] Thread 2 Thread 0 Thread 1 Shared Memory 11 11 9 9 10 10 0 0 1 1 2 2 3 4 5 6 7 8 12 13 14 15 0 1 2 2 3 3 4 4 5 6 7 data = buffer[BaseAddress + s * ThreadId] University of Michigan 28 28 Electrical Engineering and Computer Science

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#