Collaborative Speculative Loop Execution on GPU and CPU

paragon collaborative speculative loop execution l.w

1 / 19

Embed Share

A research study conducted at the University of Michigan explores the challenges in General Purpose Computing on GPU, limitations of massive data-parallelism, and ways to enhance GPU utilization. The study delves into topics like Amdahl's Law, motivation for more generalization, and Paragon execution with conflict resolution. By addressing issues such as non-linear array access and loop-carried dependencies, the aim is to make GPUs more versatile and efficient for various computing tasks.

lee_ric Follow

Uploaded on Mar 19, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi1 Amir Hormati2 Janghaeng Lee1 and Scott Mahlke1 1University of Michigan -Ann Arbor 2Microsoft Research, Microsoft University of Michigan 1 Electrical Engineering and Computer Science

Amdahls Law GPGPU may have <100x speedup but... 50% 50% NO GPU utilization GPU Executable Even 1000x here does NOT bring more than 2x in overall NO GPU utilization Execution Time University of Michigan 2 Electrical Engineering and Computer Science

General Purpose Computing on GPU Limitation of Massive Data-Parallelism Linear array access NO Indirect array access NO Pointers Leaves GPUs underutilized GPUs are not so much generalized GPU Executable How can GPUs be more GENERAL? University of Michigan 3 Electrical Engineering and Computer Science

Motivation More Generalization Reduce Sections Non-Linear array access Indirect array access Array access through pointers for(x=0; x<nx; x++){ xr = x % squaresize[XUP]; yr = y % squaresize[YUP]; i = xr + yr; lattice[i].x = x; lattice[i].y = y; } NO GPU utilization for(y=0; y<ny; y++) for(i=1; i<m; i++) for(j=iaL[i]; j<iaL[i+1]-1; j++) x[i] = x[i] - aL[j] * x[jaL[j]]; *c = *a + *b; a++; b++; c++; for(int i=0; i<n; i++){ Difficult for programmers to verify Loop-Carried Dependencies } University of Michigan 4 Electrical Engineering and Computer Science

Motivation More Generalization Reduce Sections Non-Linear array access Indirect array access Array access through pointers NO GPU utilization Difficult for programmers to verify Loop-Carried Dependencies University of Michigan 5 Electrical Engineering and Computer Science

Paragon Execution Loop 2 Sequential Loop 1 Loop 3 Sequential Sequential DO-ALL Possibly-Parallel DO-ALL Sequential CPU Sequential Sequential L2 CPU L3 L1 L2 GPU Conflict Check University of Michigan 6 Electrical Engineering and Computer Science

Paragon Execution with Conflict Loop 2 Sequential Loop 1 Loop 3 Sequential Sequential DO-ALL Possibly-Parallel DO-ALL Sequential CPU Sequential Sequential L2 CPU L1 L2 L3 GPU Conflict University of Michigan 7 Electrical Engineering and Computer Science

Paragon Process Flow Input: Sequential Code Offline Compilation Runtime Kernel Management Loop Classification Profiling Instrumentation Conflict Management Unit Execution without Profiling CUDA + pThread University of Michigan 8 Electrical Engineering and Computer Science

Offline Compilation Loop classification Sequential Loops Dependence determined at compile-time Assign to CPU statically DO-ALL Loops Assign to GPU statically Possible DO-ALL Loops Dependence can be determined at RUNTIME University of Michigan 9 Electrical Engineering and Computer Science

Runtime Profiling Spawns thread on CPU Sequential execution thread Monitoring thread Keeps track of memory foot print Marks loop Sequential If many conflicts Parallelizable If no/few conflicts Assigned to CPU and GPU University of Michigan 10 Electrical Engineering and Computer Science

Conflict Detection - Logging Lazy conflict detection Allocate memory when executing kernel write-set for store read-set for load int C_wr_log[sizeof_C]; bool C_rd_log[sizeof_C]; for (i = tid; i < N; i += ThreadCnt){ idx = I[i]; C[idx] = A[idx] + B[idx]; for (i = 0; i < N; i++){ idx = I[i]; C[idx] = A[idx] + B[idx]; AtomicInc(C_wr_log[idx]); } } University of Michigan 11 Electrical Engineering and Computer Science

Conflict Detection - Checking Done in parallel following kernel Conflict if Address written more than once Address read and written at least once C_rd_log C_wr_log Thread 1 Thread 2 Thread 3 Thread 4 F F F F 0 0 1 0 [0] [0] OK [1] [1] [2] [2] [3] [3] ... ... ... Conflict F T 0 1 2 Thread N F T [N] [N] University of Michigan 12 Electrical Engineering and Computer Science

Experimental Setup CPU Intel Core i7 GPU NVIDIA GTX 560 with 2GB DDR5 Benchmark Loops with pointers FDTD, Siedel, Jacobi2d, GEMM, TMV Indirect/Non-Linear access Saxpy, House, Ipvec, Ger, Gemver, SOR, FWD University of Michigan 13 Electrical Engineering and Computer Science

Results for Pointer Access GPU Paragon CPU 4 140X CPU 2 100 Speedup compared to sequential 90 80 70 60 50 36x 40 30 20 10 0 Fdtd Siedel Jacobi2d Jacobi1d Gemm Tmv Average University of Michigan 14 Electrical Engineering and Computer Science

Results for Indirect Access GPU Paragon CPU 4 CPU 2 16 Speedup compared to sequential 14 12 10 8 6 3.8x 4 2 0 Saxpy House Ipvec Ger Gemver SOR FWD Average University of Michigan 15 Electrical Engineering and Computer Science

Conclusion Paragon improves performance More GPU Utilization Speculatively run possibly-parallel loops on GPU No performance penalty on mis-speculation Letting CPU run sequentially at the same time Conflict checking is done in GPU University of Michigan 16 Electrical Engineering and Computer Science

Q & A University of Michigan 17 Electrical Engineering and Computer Science

Overhead Breakdown Checking Kernel Write-log Read-log 100 90 Paragon Overhead 80 Breakdown(%) 70 60 50 40 30 20 10 0 University of Michigan 18 Electrical Engineering and Computer Science

Overhead Breakdown Checking Kernel Write-log Read-log 100 Paragon Overhead Breakdown(%) 90 80 70 60 50 40 30 20 10 0 Fdtd Siedel Jacobi2d Jacobi1d Gemm Tmv Average University of Michigan 19 Electrical Engineering and Computer Science

Collaborative Speculative Loop Execution on GPU and CPU

Download Presentation

Presentation Transcript

Related

More Related Content