Understanding Parallelism in GPU Computing by Martin Kruli

Slide Note
Embed
Share

This content delves into different types of parallelism in GPU computing, such as task parallelism and data parallelism, along with discussing unsuitable problems for GPUs and providing solutions like iterative kernel execution and mapping irregular structures to regular grids. The article also touches on challenges related to synchronization and workload processing on GPUs.


Uploaded on Apr 03, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Martin Kruli by Martin Kruli (v1.4) 1 28. 11. 2023

  2. Different Types of Parallelism Task parallelism The problem is divided into tasks, which are processed independently Data parallelism The same operation is performed over many data items Pipeline parallelism Data are flowing through a sequence (or oriented graph) of stages, which operate concurrently Other types of parallelism Event-driven, by Martin Kruli (v1.4) 2 28. 11. 2023

  3. Parallelism in GPU Data parallelism The same kernel is executed by many threads Thread process one data item Limited task parallelism Multiple kernels executed simultaneously (since Fermi) At most as many kernels as SMPs But we do not have Any guarantees that two blocks/kernels will actually run concurrently Efficient means of synchronization outside the block Yes, there are cooperative groups, but sync cost skyrockets when crossing the block-wise boundary by Martin Kruli (v1.4) 3 28. 11. 2023

  4. Unsuitable Problems for GPUs Processing irregular data structures Trees, graphs, Regular structures with irregular processing workload Difficult simulations, iterative approximations Iterative tasks with explicit synchronization Pipeline-oriented tasks with many simple stages by Martin Kruli (v1.4) 4 28. 11. 2023

  5. Solutions Iterative/consecutive kernel execution Usually applicable only for cases when there are none or few dependencies between the subsequent kernels The state (or most of it) is kept on the GPU Mapping irregular structures to regular grids May be too fine/coarse grained Not always possible 2-phase Algorithms First phase determines the amount of work (items, ) Second phase process tasks mapped by first phase by Martin Kruli (v1.4) 5 28. 11. 2023

  6. Persistent Threads Single kernel executed so there are as many thread-blocks as SMPs Workload is generated and processed on the GPU Using global queue implemented by atomic operations Possible problems The kernel may be quite complex There are no guarantees that two blocks will ever run concurrently Possibility of creating deadlock The synchronization overhead is significant But it might be lower than host-based synchronization by Martin Kruli (v1.4) 6 28. 11. 2023

  7. by Martin Kruli (v1.4) 7 28. 11. 2023

  8. CUDA Dynamic Parallelism Presented in CC 3.5 (Kepler) GPU threads are allowed to launch new grids by Martin Kruli (v1.4) 8 28. 11. 2023

  9. Dynamic Parallelism Purpose The device does not need to synchronize with host to issue new work to the device Irregular parallelism may be expressed more easily by Martin Kruli (v1.4) 9 28. 11. 2023

  10. How It Works Portions of CUDA runtime are ported to the device Kernel execution Device synchronization Streams, events, and async. memory operations Kernel launches are asynchronous No guarantee the child kernel starts immediately Synchronization points may cause a context switch Context of an entire block has to be switched off an SMP Block-wise locality of resources Streams and events are valid within a thread block by Martin Kruli (v1.4) 10 28. 11. 2023

  11. Example __global__ void child_launch(int *data) { data[threadIdx.x] = data[threadIdx.x]+1; } __global__ void parent_launch(int *data) { data[threadIdx.x] = threadIdx.x; __syncthreads(); if (threadIdx.x == 0) { child_launch<<< 1, 256 >>>(data); cudaDeviceSynchronize(); } __syncthreads(); } Thread 0 invokes a grid of child threads Synchronization does not have to be invoked by all threads Device synchronization does not synchronize threads in the block void host_launch(int *data) { parent_launch<<< 1, 256 >>>(data); } by Martin Kruli (v1.4) 11 28. 11. 2023

  12. Memory Consistency by Martin Kruli (v1.4) 12 28. 11. 2023

  13. Depth Limits Nesting depth (depth of the recursion) Synchronization depth (deepest level where cudaDeviceSynchronize() is invoked) cudaDeviceSetLimit() cudaLimitDevRuntimeSyncDepth cudaLimitDevRuntimePendingLaunchCount number of pending grids by Martin Kruli (v1.4) 13 28. 11. 2023

  14. Compilation Issues CC 3.5 or higher -rdc=true Dynamic linking Before host linking With cudadevrt by Martin Kruli (v1.4) 14 28. 11. 2023

  15. Very short kernels The launching overhead becomes more significant Especially if the kernel execution is measured in the order of micro-seconds If many small kernels are executed consecutively, we need to overlap kernel launching and execution CUDA Graphs allow us to pre-record the sequence (graph) of kernel launches (and their synchronizations) and execute them all at once by Martin Kruli (v1.4) 15 28. 11. 2023

  16. Example And assuming there are many epochs (e.g., thousands) constexpr int steps = 20; for (int epoch = 0; epoch < epochs; ++epoch) { for (int i = 0; i < steps; ++i) { stencilKernel<<<..., stream>>>(...); } smoothKernel<<<..., stream>>>(...); cudaStreamSynchronize(stream); ... } Assuming the kernels will take microseconds to execute by Martin Kruli (v1.4) 16 28. 11. 2023

  17. Graph holds the structure, instance is ready to be executed cudaGraph_t graph; cudaGraphExec_t instance; bool graphCreated = false; for (int epoch = 0; epoch < epochs; ++epoch) { if (!graphCreated) { cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal); for (int i = 0; i < steps; ++i) { stencilKernel<<<..., stream>>>(...); } smoothKernel<<<..., stream>>>(...); cudaStreamEndCapture(stream, &graph); cudaGraphInstantiate(&instance, graph, 0); graphCreated = true; } cudaGraphLaunch(instance, stream); cudaStreamSynchronize(stream); } Not actually executed, only captured for the graph All captured kernels launched (in order) at once by Martin Kruli (v1.4) 17 28. 11. 2023

  18. CUDA Graphs May be more complex and non-linear As an alternative to stream-capturing, the graph can be constructed manually using API calls Graph node types Kernel execution CPU function call Memory copy, memset Recording an event, waiting for an event Signaling/waiting on an external semaphore Child (nested) graph by Martin Kruli (v1.4) 18 28. 11. 2023

  19. by Martin Kruli (v1.4) 19 28. 11. 2023

  20. Martin Kruli by Martin Kruli (v1.4) 20 28. 11. 2023

  21. Wide variety available Core NVIDIA libraries Libcu++, CUB, Thrust Math libraries cuBLAS, cuFFT, cuSolver AI libraries cuDNN Other specialized libraries PhysX, Libraries with higher abstraction (and possibly portability) by Martin Kruli (v1.4) 21 28. 11. 2023

  22. Standard C++ library for CUDA Header-only lib for heterogeneous support of C++ STL Extended API with more elaborate sync. primitives and other constructs Part of CUDA Toolkit and NVIDIA HPC SDK also in CUDA Core C++ libraries https://github.com/NVIDIA/cccl #include <atomic> std::atomic<int> x; Std. C++ (host only) #include <cuda/std/atomic> cuda::std::atomic<int> x; CUDA C++ (host and device, conforming to standard) #include <cuda/atomic> cuda::atomic<int, cuda::thread_scope_block> x; CUDA C++ (host and device, extension) by Martin Kruli (v1.4) 22 28. 11. 2023

  23. Standard API Time <cuda/std/chrono> Numerics <cuda/std/complex>, <cuda/std/complex> <cuda/std/ratio>, <cuda/std/cfloat> <cuda/std/climits>, <cuda/std/cstdint> Utility <cuda/std/type_traits> <cuda/std/tuple> <cuda/std/functional> <cuda/std/utility> <cuda/std/version> by Martin Kruli (v1.4) 23 28. 11. 2023

  24. Synchronization Atomics (cuda::atomic, cuda::atomic_ref) Latches (cuda::latch) Barriers (cuda::barrier) Semaphores (cuda::counting_semaphore, cuda::binary_semaphore) Pipelines (cuda::pipeline) Coordination mechanism for sequencing async. operations (in stages) Acquire, commit operations, wait, release Can be used for individual threads as well as thread groups by Martin Kruli (v1.4) 24 28. 11. 2023

  25. Async operations cuda::memcpy_async Thread groups and memory model abstraction struct ThreadGroup cuda::thread_scope Memory access properties cuda::annotated_ptr cuda::access_property::normal, shared, global, persisting, streaming by Martin Kruli (v1.4) 25 28. 11. 2023

  26. CUB reusable components Also in CUDA Core C++ libraries Basic parallel collective primitives Warp/block/device-wide Prefix-scan, sort, reduction, cooperative I/O, histogram Optimized for the target architecture, supports dynamic parallelism, Utilities Iterators Thread and block I/O (with cache management) PTX intrinsics Device, execution, and storage management by Martin Kruli (v1.4) 26 28. 11. 2023

  27. CUDA Thrust Also in CUDA Core C++ libraries C++ template library based on STL API The basic idea is to develop C++ parallel applications with minimal overhead (with higher-level abstraction) STL-like vectors (for host and device) and vector algorithms thrust::host_vector<double> h_vec(N); thrust::generate(h_vec.begin(), h_vec.end(), [&] { return ... }); thrust::device_vector<double> d_vec = h_vec; Yes, this is all we need to allocate the device buffer and copy the data by Martin Kruli (v1.4) 27 28. 11. 2023

  28. Algorithms Copying (scatter, gather), merging, prefix-sums, reductions, reordering, searching, sorting, setters, generators, transformations, Govern by execution policies (thrust::host, thrust::device) Containers, memory management thrust::device_vector, device_allocator, memory resources Iterators Counting, permutation, zip, Others Basic types, system tools, numerics, random numbers, by Martin Kruli (v1.4) 28 28. 11. 2023

  29. CUDA Basic Linear Algebra Subroutines CUDA implementation of standard BLAS library Complete support of all 152 functions on vectors/matrices copy, move, rotate, swap maximum, minimum, multiply by scalar sum, dot products, Euclidean norms matrix multiplications, inverses, linear combinations Some operations have batch versions Supports floats, doubles, and complex numbers by Martin Kruli (v1.4) 29 28. 11. 2023

  30. CUDA Sparse Linear Algebra Open source C++ library for sparse linear structures (matrices, linear systems, ) Key features Sparse matrix operations (add, subtraction, max independent set, polynomial relaxation, ) Supports various matrix formats COO, CSR, DIA, ELL, and HYB Require CUDA CC 2.0 or higher by Martin Kruli (v1.4) 30 28. 11. 2023

  31. CUDA Fast Fourier Transform Decompose signal to the frequency spectrum 1-3D transforms (up to 128M elements) Many variations (precision, complex/real types, ) API similar to the FFTW library Create a plan (cufftHandle) which holds the configuration Associate/allocate workspace (buffers) cufftExecC2C() (or R2C, C2R) starts execution FFT plan can be associated with the CUDA stream For synchronization and overlapping by Martin Kruli (v1.4) 31 28. 11. 2023

  32. GPU AI for Board Games Specific AI library designed for games with large, but well-defined configuration space Requires CUDA CC 2.0 Currently supports Game Tree Split alpha/beta pruning Single and multiple recursion (with large depths) Zero-sum games (3D Tic-Tac-Toe, Reversi, ) Sudoku backtracking generator and solver Statistical simulations (Monte Carlo for Go) by Martin Kruli (v1.4) 32 28. 11. 2023

  33. PhysX Real-time physics engine Originally developed by Ageia for PPU card NVIDIA bought it and re-implemented it for CUDA Most important features Simulation of rigid bodies (collisions, destruction) Cloths and fluid particle systems APEX A framework built on top of PhysX Designed for easy usage (artists, games, ) by Martin Kruli (v1.4) 33 28. 11. 2023

  34. by Martin Kruli (v1.4) 34 28. 11. 2023

  35. Martin Kruli by Martin Kruli (v1.4) 35 28. 11. 2023

  36. Interoperability Allows CUDA code to read/write graphical buffers Works with OpenGL and Direct3D libraries Motivation Direct visualization of complex simulations Augmenting 3D rendering with visualization routines which are difficult to implement in shaders How it works The graphics resource is registered and represented by struct cudaGraphicResource The resource may be mapped to CUDA memory space cudaGraphicsMapResources(), by Martin Kruli (v1.4) 36 28. 11. 2023

  37. Initialization Device must be selected by cudaGLSetGLDevice() Resources cudaGraphicsGLRegisterBuffer() for buffers The mapped buffers can be accessed in the same way as CUDA allocated memory cudaGraphicsGLRegisterImage() for images and render buffers The image buffers can be also accessed through texture and surface mechanisms Examples Examples by Martin Kruli (v1.4) 37 28. 11. 2023

  38. Direct3D Support Versions 9, 10, and 11 are supported Each version has its own API CUDA context may operate with one Direct3D device at a time And special HW mode must be set on the device Initialization is similar to OpenGL cudaD3D[9|10|11]SetDirect3DDevice() Available Direct3D resources Buffers, textures, and surfaces All using cudaGraphicsD3DXXRegisterResource() by Martin Kruli (v1.4) 38 28. 11. 2023

  39. by Martin Kruli (v1.4) 39 28. 11. 2023

Related


More Related Content