Understanding GPU Programming Models and Execution Architecture
Explore the world of GPU programming with insights into GPU architecture, programming models, and execution models. Discover the evolution of GPUs and their importance in graphics engines and high-performance computing, as discussed by experts from the University of Michigan.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Sponge: Portable Stream Programming on Graphics Engines Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke University of Michigan Electrical Engineering and Computer Science
Why GPUs? Every mobile and desktop system will have one Affordable and high performance Sony PlayStation Phone 1500 GeForce GTX 480 NVIDIA GPU 1250 Theoretical GFLOPS/s INTEL CPU 1000 GeForce GTX 280 750 Over-provisioned GeForce 8800 GTX 500 GeForce 7800 GTX GeForce 6800 Ultra 250 0 Programmable 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 University of Michigan 2 Electrical Engineering and Computer Science
GPU Architecture SM 0 SM 1 SM 29 Shared 0 2 4 6 Shared 0 2 4 6 Shared 0 2 4 6 CPU 1 3 5 7 1 3 5 7 1 3 5 7 Shared Memory 0 1 Kernel 1 Time Regs Regs Regs 2 3 4 5 7 6 Interconnection Network Registers Kernel 2 Global Memory (Device Memory) University of Michigan 3 Electrical Engineering and Computer Science
GPU Programming Model int RegisterVar Thread Block Per-block Register Threads Blocks Grid Per-block Shared Memory Per-block Shared Memory __shared__ int SharedVar int LocalVarArray[10] All the threads run one kernel Grid 0 Grid Sequence Registers private to each thread Per-app Device Global Memory Grid 1 Registers spill to local memory Shared memory shared between threads of a block __shared__ int GlobalVar Thread Thread Global memory shared between all blocks Per-thread Local Memory Per-thread Register int LocalVarArray[10] University of Michigan 4 Electrical Engineering and Computer Science
GPU Execution Model SM 0 SM 1 SM 2 SM 3 SM 30 Shared 0 2 Shared 0 2 Shared 0 2 Shared 0 2 Shared 0 2 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 4 6 4 6 4 6 4 6 4 6 Regs Regs Regs Regs Regs Grid 1 University of Michigan 5 Electrical Engineering and Computer Science
GPU Execution Model Block 0 SM0 ThreadId Shared 0 Warp 0 31 32 Warp 1 63 1 0 Block 1 2 3 4 5 Block 2 6 7 Registers Block 3 University of Michigan 6 Electrical Engineering and Computer Science
GPU Programming Challenges Data restructuring for complex memory hierarchy efficiently Global memory, Shared memory, Registers 400 High Performance Desktop Mobile 350 8 Partitioning work between CPU and GPU 16 300 Lack of portability between different generations of GPU Registers, active warps, size of global memory, size of shared memory 250 Time (ms) 200 32 48 150 Optimized for GeForce 8400 GS 64 Will vary even more Newer high performance cards e.g. NVIDA s Fermi Mobile GPUs with less resources 100 Optimized for GeForce GTX 285 50 0 Number of Registers Per Thread University of Michigan 7 Electrical Engineering and Computer Science
Nonlinear Optimization Space SAD Optimization Space 908 Configurations We need higher level of abstraction! [Ryoo , CGO 08] University of Michigan 8 Electrical Engineering and Computer Science
Goals Write-once parallel software Free the programmer from low-level details (C + Pthreads) Shared Memory Processors (C +Intrinsics) SIMD Engines Parallel Specification (Verilog/VHDL) FPGAs (CUDA/OpenCL) GPUs University of Michigan 9 Electrical Engineering and Computer Science
Streaming Higher-level of abstraction Actor 1 Decoupling computation and memory accesses Splitter Coarse grain exposed parallelism, exposed communication Actor 2 Actor 3 Actor 4 Actor 5 Programmers can focus on the algorithms instead of low-level details Joiner Streaming actors use buffers to communicate Actor 6 A lot of recent works on extending portability of streaming applications University of Michigan 10 Electrical Engineering and Computer Science
Sponge Generating optimized CUDA for a wide variety of GPU targets Reorganization and Classification Perform an array of optimizations on stream graphs Shared/Global Memory Memory Layout Helper Threads Optimizing and porting to different generations Bank Conflict Resolution Graph Restructuring Utilize memory hierarchy (registers, shared memory, coallescing) Software Prefetching Register Optimization Loop Unrolling Efficiently utilize streaming cores University of Michigan 11 Electrical Engineering and Computer Science
GPU Performance Model - Memory bound Kernels Memory Time M 0 M 1 M 2 M 3 M 4 M 5 M 6 M 7 C 7 C 0 C 1 C 2 C 3 C 4 C 5 C 6 - Computation bound Kernels Computation Time M 0 M 1 C 0 M 2 M 3 M 4 M 5 C 3 M 6 C 4 M 7 C1 C 2 C 5 C 6 C 7 Memory Instructions Computation Instructions M C University of Michigan 12 Electrical Engineering and Computer Science
Actor Classification High Traffic Actors(HiT) Large number of memory accesses per actor Less number of threads with shared memory Using shared memory underutilizes the processors Low Traffic Actors(LoT) Less number of memory accesses per actor More number of threads Using shared memory increases the performance University of Michigan 13 Electrical Engineering and Computer Science
Global Memory Accesses Large access latency Not access the words in sequence No coalescing A[i, j] pushes Actor A has i pops and j Global Memory 3 3 7 7 11 11 15 15 0 0 1 1 2 2 4 4 5 5 6 6 8 8 9 9 10 10 12 12 13 13 14 14 A[4,4] A[4,4] A[4,4] A[4,4] Thread 0 Thread 1 Thread 2 Thread 3 Global Memory 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 University of Michigan 14 Electrical Engineering and Computer Science
Shared Memory Global Memory 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 Global To Shared Global To Shared Global To Shared Global To Shared Shared Memory 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 A[4,4] A[4,4] A[4,4] A[4,4] Shared Memory 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 0 0 1 1 2 2 3 3 Shared to Global Shared to Global Shared to Global Shared to Global Thread 0 1 1 Thread 1 5 5 Thread 2 9 9 Thread 3 13 13 Global Memory 4 4 6 6 7 7 8 8 10 10 11 11 12 12 14 14 15 15 0 0 2 2 3 3 University of Michigan 15 Electrical Engineering and Computer Science
Using Shared Memory Begin Kernel <<<Blocks, Threads>>>: Begin Kernel <<<Blocks, Threads>>>: Shared memory is 100x faster than global memory For number of iterations For number of pops Shared For number of iterations Global syncthreads Work Work Coalesce all global memory accesses Number of threads is limited by size of the shared memory. syncthreads For number of pushs End Kernel Shared Global End Kernel University of Michigan 16 Electrical Engineering and Computer Science
Helper Threads Shared memory limits the number of threads. Begin Kernel <<< Blocks, Threads + Helper >>>: Begin Kernel <<<Blocks, Threads>>>: For number of iterations If helper threads Shared For number of iterations Global Underutilized processors can fetch data. syncthreads If worker threads Work Work All the helper threads are in one warp. (no control flow divergence) syncthreads If helper threads Shared Global End Kernel End Kernel University of Michigan 17 Electrical Engineering and Computer Science
Data Prefetch Begin Kernel <<<Blocks, Threads>>>: Begin Kernel <<<Blocks, Threads>>>: For number of iterations For number of pops Better register utilization For number of pops Regs Global For number of iterations For number of pops Shared syncthreads Global Data for iteration i+1 is moved to registers Shared Regs Work syncthreads If not the last iteration For number of pops Regs Data for iteration i is moved from register to shared memory Global Work syncthreads For number of pushs Shared Global syncthreads For number of pushs Shared Allows the GPU to overlap instructions Global End Kernel End Kernel University of Michigan 18 Electrical Engineering and Computer Science
Loop unrolling Similar to traditional unrolling Begin Kernel <<<Blocks, Threads>>>: For number of iterations/2 For number of pops Shared Global Allows the GPU to overlap instructions syncthreads Work syncthreads For number of pushs Shared Better register utilization Global For number of pops Shared Global Less loop control overhead syncthreads Work Can also be applied to memory transfer loops syncthreads For number of pushs Shared Global End Kernel University of Michigan 19 Electrical Engineering and Computer Science
Methodology Set of benchmarks from the StreamIt Suite 3GHz Intel Core 2 Duo CPU with 6GB RAM Nvidia Geforce GTX 285 Stream Processors 240 Processor clock 1476 MHz Memory Configuration 2GB DDR3 Memory Bandwidth 159.0 GB/s University of Michigan 20 Electrical Engineering and Computer Science
Result (Baseline CPU) With Transfer Without Transfer 50 45 40 35 30 Speedup(X) 24 25 20 15 10 10 5 0 University of Michigan 21 Electrical Engineering and Computer Science
Result (Baseline GPU) Shared/Global Prefetch/Unrolling Helper Threads Graph Restructuring 7 6 5 Speedup(X) 4 16% 16% 3 3% 2 64% 1 0 University of Michigan 22 Electrical Engineering and Computer Science
Conclusion Future systems will be heterogeneous GPUs are important part of such systems Programming complexity is a significant challenge Sponge automatically creates optimized CUDA code for a wide variety of GPU targets Provide portability by performing an array of optimizations on stream graphs University of Michigan 23 Electrical Engineering and Computer Science
Questions University of Michigan 24 Electrical Engineering and Computer Science
Spatial Intermediate Representation StreamIt Main Constructs: Filter Encapsulate computation. Pipeline Expressing pipeline parallelism. Splitjoin Expressing task-level parallelism. Other constructs not relevant here Exposes different types of parallelism Composable, hierarchical Stateful and stateless filters filter pipeline splitjoin University of Michigan 25 Electrical Engineering and Computer Science
Nonlinear Optimization Space SAD Optimization Space 908 Configurations [Ryoo , CGO 08] University of Michigan 26 Electrical Engineering and Computer Science
Bank Conflict Conflict Shared Memory 10 10 2 2 9 9 2 2 8 8 1 1 1 1 0 0 3 4 5 6 7 11 12 13 14 15 0 0 3 4 5 6 7 A[8,8] A[8,8] A[8,8] Thread 0 Thread 1 Thread 2 Shared Memory 10 10 2 2 9 9 2 2 8 8 1 1 1 1 0 0 0 0 3 4 5 6 7 11 12 13 14 15 3 4 5 6 7 data = buffer[BaseAddress + s * ThreadId] University of Michigan 27 27 Electrical Engineering and Computer Science
Removing Bank Conflict if GCD( # of bank, s) is 1 there will be no bank conflict s must be odd Shared Memory 11 11 9 9 10 10 2 2 4 4 0 0 1 1 3 4 5 6 7 8 12 13 14 15 0 1 2 2 3 3 5 6 7 A[8,8] A[8,8] A[8,8] Thread 2 Thread 0 Thread 1 Shared Memory 11 11 9 9 10 10 0 0 1 1 2 2 3 4 5 6 7 8 12 13 14 15 0 1 2 2 3 3 4 4 5 6 7 data = buffer[BaseAddress + s * ThreadId] University of Michigan 28 28 Electrical Engineering and Computer Science