Parallelism in GPU Computing by Martin Kruli

undefined
 
 
Martin Kruli
š
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
1
 
Different Types of Parallelism
Task parallelism
The problem is divided into tasks, which are processed independently
Data parallelism
The same operation is performed over many data items
Pipeline parallelism
Data are flowing through a sequence (or oriented graph) of stages, which
operate concurrently
Other types of parallelism
Event-driven, …
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
2
 
Parallelism in GPU
Data parallelism
The same kernel is executed by many threads
Thread process one data item
Limited task parallelism
Multiple kernels executed simultaneously (since Fermi)
At most as many kernels as SMPs
But we do not have…
Any guarantees that two blocks/kernels will actually run concurrently
Efficient means of synchronization outside the block
Yes, there are cooperative groups, but sync cost skyrockets when crossing the
block-wise boundary
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
3
Unsuitable Problems for GPUs
Processing irregular data structures
Trees, graphs, …
Regular structures with irregular processing workload
Difficult simulations, iterative approximations
Iterative tasks with explicit synchronization
Pipeline-oriented tasks with many simple stages
28. 11. 2023
by Martin Kruliš (v1.4)
4
 
Solutions
Iterative/consecutive kernel execution
Usually applicable only for cases when there are none or few dependencies
between the subsequent kernels
The state (or most of it) is kept on the GPU
Mapping irregular structures to regular grids
May be too fine/coarse grained
Not always possible
2-phase Algorithms
First phase determines the amount of work (items, …)
Second phase process tasks mapped by first phase
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
5
 
Persistent Threads
Single kernel executed so there are as many thread-blocks as SMPs
Workload is generated and processed on the GPU
Using global queue implemented by atomic operations
Possible problems
The kernel may be quite complex
There are no guarantees that two blocks will ever run concurrently
Possibility of creating deadlock
The synchronization overhead is significant
But it might be lower than host-based synchronization
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
6
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
7
 
CUDA Dynamic Parallelism
Presented in CC 3.5 (Kepler)
GPU threads are allowed to launch new grids
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
8
 
Dynamic Parallelism Purpose
The device does not need to synchronize with host to issue new work
to the device
Irregular parallelism may be expressed more easily
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
9
 
How It Works
Portions of CUDA runtime are ported to the device
Kernel execution
Device synchronization
Streams, events, and async. memory operations
Kernel launches are asynchronous
No guarantee the child kernel starts immediately
Synchronization points may cause a context switch
Context of an entire block has to be switched off an SMP
Block-wise locality of resources
Streams and events are valid within a thread block
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
10
Example
__global__ 
void 
child_launch(
int
 *data) {
    
data[threadIdx.x] = data[threadIdx.x]+1;
}
__global__ 
void 
parent_launch(
int
 *data) {
    
data[threadIdx.x] = threadIdx.x;
    
__syncthreads
();
    
if 
(threadIdx.x == 0) {
        
child_launch<<< 1, 256 >>>
(data);
        
cudaDeviceSynchronize();
    
}
    
__syncthreads
();
}
void 
host_launch(
int
 *data) {
    
parent_launch<<< 1, 256 >>>(data);
}
28. 11. 2023
by Martin Kruliš (v1.4)
11
Synchronization does not have to be invoked by all threads
Thread 0 invokes a grid of child threads
Device synchronization does not synchronize threads in the block
 
Memory Consistency
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
12
 
Depth Limits
Nesting depth (depth of the recursion)
Synchronization depth (deepest level where
cudaDeviceSynchronize()
 is invoked)
cudaDeviceSetLimit()
cudaLimitDevRuntimeSyncDepth
cudaLimitDevRuntimePendingLaunchCount
number of pending grids
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
13
 
Compilation Issues
CC 3.5 or higher
-rdc=true
Dynamic linking
Before host linking
With 
cudadevrt
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
14
 
Very short kernels
The launching overhead becomes more
significant
Especially if the kernel execution is measured in
the order of micro-seconds
If many small kernels are executed
consecutively, we need to overlap kernel
launching and execution
CUDA Graphs allow us to pre-record the
sequence (graph) of kernel launches (and
their synchronizations) and execute them all
at once
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
15
Example
constexpr
 
int
 steps = 20;
for
 (
int
 epoch = 0; epoch < epochs; ++epoch) {
  
for
 (
int
 i = 0; i < steps; ++i) {
    
stencilKernel
<<<..., 
stream
>>>(...);
  }
  
smoothKernel
<<<..., 
stream
>>>(...);
  cudaStreamSynchronize(
stream
);
  ...
}
28. 11. 2023
by Martin Kruliš (v1.4)
16
Assuming the kernels will take
microseconds to execute
And assuming there are many
epochs (e.g., thousands)
cudaGraph_t
 graph;
cudaGraphExec_t
 instance;
bool
 graphCreated = 
false
;
for
 (
int
 epoch = 0; epoch < epochs; ++epoch) {
  
if
 (!graphCreated) {
    
cudaStreamBeginCapture
(stream, cudaStreamCaptureModeGlobal);
    
for
 (
int
 i = 0; i < steps; ++i) {
      
stencilKernel
<<<..., stream>>>(...);
    }
    
smoothKernel
<<<..., stream>>>(...);
    
cudaStreamEndCapture
(stream, &graph);
    
cudaGraphInstantiate
(&instance, graph, 0);
    graphCreated = 
true
;
  }
  
cudaGraphLaunch
(instance, stream);
  cudaStreamSynchronize(stream);
}
28. 11. 2023
by Martin Kruliš (v1.4)
17
Graph holds the structure,
instance is ready to be executed
Not actually executed, only captured for the graph
All captured kernels launched (in order) at once
 
CUDA Graphs
May be more complex and non-linear
As an alternative to stream-capturing, the graph can be constructed
manually using API calls
Graph node types
Kernel execution
CPU function call
Memory copy, memset
Recording an event, waiting for an event
Signaling/waiting on an external semaphore
Child (nested) graph
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
18
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
19
undefined
 
 
Martin Kruli
š
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
20
 
Wide variety available
Core NVIDIA libraries
Libcu++, CUB, Thrust
Math libraries
cuBLAS, cuFFT, cuSolver
AI libraries
cuDNN
Other specialized libraries
PhysX,
Libraries with higher abstraction (and possibly portability)
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
21
 
Standard C++ library for CUDA
Header-only lib for heterogeneous support of C++ STL
Extended API with more elaborate sync. primitives and other constructs
Part of CUDA Toolkit and NVIDIA HPC SDK
also in CUDA Core C++ libraries
 
https://github.com/NVIDIA/cccl
 
#include <atomic>
std
::atomic<int> x;
 
#include <cuda/std/atomic>
cuda::std
::atomic<int> x;
 
#include <cuda/atomic>
cuda
::atomic<int, cuda::thread_scope_block> x;
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
22
Std. C++ (host only)
CUDA C++ (host 
and device, conforming to standard
)
CUDA C++ (host 
and device, extension
)
 
Standard API
Time
<cuda/std/chrono>
Numerics
<cuda/std/complex>, <cuda/std/complex>
<cuda/std/ratio>, <cuda/std/cfloat>
<cuda/std/climits>, <cuda/std/cstdint>
Utility
<cuda/std/type_traits>
<cuda/std/tuple>
<cuda/std/functional>
<cuda/std/utility>
<cuda/std/version>
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
23
 
Synchronization
Atomics (
cuda::atomic
, 
cuda::atomic_ref
)
Latches (
cuda::latch
)
Barriers (
cuda::barrier
)
Semaphores (
cuda::counting_semaphore
, 
cuda::binary_semaphore
)
Pipelines (
cuda::pipeline
)
Coordination mechanism for sequencing async. operations (in stages)
Acquire, commit operations, wait, release
Can be used for individual threads as well as thread groups
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
24
 
Async operations
cuda::memcpy_async
 
Thread groups and memory model abstraction
struct ThreadGroup
cuda::thread_scope
 
Memory access properties
cuda::annotated_ptr
cuda::access_property::normal
, 
shared
, 
global
,
persisting
, 
streaming
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
25
 
CUB reusable components
Also in CUDA Core C++ libraries
Basic parallel collective primitives
Warp/block/device-wide
Prefix-scan, sort, reduction, cooperative I/O, histogram…
Optimized for the target architecture, supports dynamic parallelism, …
Utilities
Iterators
Thread and block I/O (with cache management)
PTX intrinsics
Device, execution, and storage management
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
26
 
CUDA Thrust
Also in CUDA Core C++ libraries
C++ template library based on STL API
The basic idea is to develop C++ parallel applications with minimal
overhead (with higher-level abstraction)
STL-like vectors (for host and device) and vector algorithms
 
thrust::host_vector<double> h_vec(N);
thrust::generate(h_vec.begin(), h_vec.end(), [&] { return ... });
 
thrust::device_vector<double> d_vec = h_vec;
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
27
Yes, this is all we need to allocate the
device buffer and copy the data…
 
Algorithms
Copying (scatter, gather), merging, prefix-sums, reductions,
reordering, searching, sorting, setters, generators, transformations, …
Govern by execution policies (thrust::host, thrust::device)
Containers, memory management
thrust::device_vector
, device_allocator
, memory resources
Iterators
Counting, permutation, zip, …
Others
Basic types, system tools, 
numerics, random numbers, …
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
28
 
CUDA Basic Linear Algebra Subroutines
CUDA implementation of standard BLAS library
Complete support of all 152 functions on vectors/matrices
copy, move, rotate, swap
maximum, minimum, multiply by scalar
sum, dot products, Euclidean norms
matrix multiplications, inverses, linear combinations
Some operations have batch versions
Supports floats, doubles, and complex numbers
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
29
 
CUDA Sparse Linear Algebra
Open source C++ library for sparse linear structures (matrices, linear
systems, …)
Key features
Sparse matrix operations (add, subtraction, max independent set,
polynomial relaxation, …)
Supports various matrix formats
COO, CSR, DIA, ELL, and HYB
Require CUDA CC 2.0 or higher
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
30
 
CUDA Fast Fourier Transform
Decompose signal to the frequency spectrum
1-3D transforms (up to 128M elements)
Many variations (precision, complex/real types, …)
API similar to the FFTW library
Create a plan (
cufftHandle
) which holds the configuration
Associate/allocate workspace (buffers)
cufftExecC2C
()
 (or 
R2C
, 
C2R
) starts execution
FFT plan can be associated with the CUDA stream
For synchronization and overlapping
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
31
 
GPU AI for Board Games
Specific AI library designed for games with large, but well-defined
configuration space
Requires CUDA CC 2.0
Currently supports
Game Tree Split – alpha/beta pruning
Single and multiple recursion (with large depths)
Zero-sum games (3D Tic-Tac-Toe, Reversi, …)
Sudoku backtracking generator and solver
Statistical simulations (Monte Carlo for Go)
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
32
 
PhysX
Real-time physics engine
Originally developed by Ageia for PPU card
NVIDIA bought it and re-implemented it for CUDA
Most important features
Simulation of rigid bodies (collisions, destruction)
Cloths and fluid particle systems
 
APEX
A framework built on top of PhysX
Designed for easy usage (artists, games, …)
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
33
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
34
undefined
 
 
Martin Kruli
š
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
35
 
Interoperability
Allows CUDA code to read/write graphical buffers
Works with OpenGL and Direct3D libraries
Motivation
Direct visualization of complex simulations
Augmenting 3D rendering with visualization routines which are difficult to
implement in shaders
How it works
The graphics resource is registered and represented by 
struct
cudaGraphicResource
The resource may be mapped to CUDA memory space
cudaGraphicsMapResources()
, …
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
36
Initialization
Device must be selected by 
cudaGLSetGLDevice()
Resources
cudaGraphicsGLRegisterBuffer()
 for buffers
The mapped buffers can be accessed in the same way as CUDA allocated
memory
cudaGraphicsGLRegisterImage()
 for images and render buffers
The image buffers can be also accessed through texture and surface
mechanisms
28. 11. 2023
by Martin Kruliš (v1.4)
37
Examples
 
Direct3D Support
Versions 9, 10, and 11 are supported
Each version has its own API
CUDA context may operate with one Direct3D device at a time
And special HW mode must be set on the device
Initialization is similar to OpenGL
cudaD3D[9|10|11]SetDirect3DDevice()
Available Direct3D resources
Buffers, textures, and surfaces
All using 
cudaGraphicsD3D
XX
RegisterResource
()
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
38
 
28. 11. 2023
 
by Martin Kruliš (v1.4)
 
39
Slide Note
Embed
Share

This content delves into different types of parallelism in GPU computing, such as task parallelism and data parallelism, along with discussing unsuitable problems for GPUs and providing solutions like iterative kernel execution and mapping irregular structures to regular grids. The article also touches on challenges related to synchronization and workload processing on GPUs.

  • Parallelism
  • GPU computing
  • Task parallelism
  • Data parallelism
  • Solutions

Uploaded on Apr 03, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Martin Kruli by Martin Kruli (v1.4) 1 28. 11. 2023

  2. Different Types of Parallelism Task parallelism The problem is divided into tasks, which are processed independently Data parallelism The same operation is performed over many data items Pipeline parallelism Data are flowing through a sequence (or oriented graph) of stages, which operate concurrently Other types of parallelism Event-driven, by Martin Kruli (v1.4) 2 28. 11. 2023

  3. Parallelism in GPU Data parallelism The same kernel is executed by many threads Thread process one data item Limited task parallelism Multiple kernels executed simultaneously (since Fermi) At most as many kernels as SMPs But we do not have Any guarantees that two blocks/kernels will actually run concurrently Efficient means of synchronization outside the block Yes, there are cooperative groups, but sync cost skyrockets when crossing the block-wise boundary by Martin Kruli (v1.4) 3 28. 11. 2023

  4. Unsuitable Problems for GPUs Processing irregular data structures Trees, graphs, Regular structures with irregular processing workload Difficult simulations, iterative approximations Iterative tasks with explicit synchronization Pipeline-oriented tasks with many simple stages by Martin Kruli (v1.4) 4 28. 11. 2023

  5. Solutions Iterative/consecutive kernel execution Usually applicable only for cases when there are none or few dependencies between the subsequent kernels The state (or most of it) is kept on the GPU Mapping irregular structures to regular grids May be too fine/coarse grained Not always possible 2-phase Algorithms First phase determines the amount of work (items, ) Second phase process tasks mapped by first phase by Martin Kruli (v1.4) 5 28. 11. 2023

  6. Persistent Threads Single kernel executed so there are as many thread-blocks as SMPs Workload is generated and processed on the GPU Using global queue implemented by atomic operations Possible problems The kernel may be quite complex There are no guarantees that two blocks will ever run concurrently Possibility of creating deadlock The synchronization overhead is significant But it might be lower than host-based synchronization by Martin Kruli (v1.4) 6 28. 11. 2023

  7. by Martin Kruli (v1.4) 7 28. 11. 2023

  8. CUDA Dynamic Parallelism Presented in CC 3.5 (Kepler) GPU threads are allowed to launch new grids by Martin Kruli (v1.4) 8 28. 11. 2023

  9. Dynamic Parallelism Purpose The device does not need to synchronize with host to issue new work to the device Irregular parallelism may be expressed more easily by Martin Kruli (v1.4) 9 28. 11. 2023

  10. How It Works Portions of CUDA runtime are ported to the device Kernel execution Device synchronization Streams, events, and async. memory operations Kernel launches are asynchronous No guarantee the child kernel starts immediately Synchronization points may cause a context switch Context of an entire block has to be switched off an SMP Block-wise locality of resources Streams and events are valid within a thread block by Martin Kruli (v1.4) 10 28. 11. 2023

  11. Example __global__ void child_launch(int *data) { data[threadIdx.x] = data[threadIdx.x]+1; } __global__ void parent_launch(int *data) { data[threadIdx.x] = threadIdx.x; __syncthreads(); if (threadIdx.x == 0) { child_launch<<< 1, 256 >>>(data); cudaDeviceSynchronize(); } __syncthreads(); } Thread 0 invokes a grid of child threads Synchronization does not have to be invoked by all threads Device synchronization does not synchronize threads in the block void host_launch(int *data) { parent_launch<<< 1, 256 >>>(data); } by Martin Kruli (v1.4) 11 28. 11. 2023

  12. Memory Consistency by Martin Kruli (v1.4) 12 28. 11. 2023

  13. Depth Limits Nesting depth (depth of the recursion) Synchronization depth (deepest level where cudaDeviceSynchronize() is invoked) cudaDeviceSetLimit() cudaLimitDevRuntimeSyncDepth cudaLimitDevRuntimePendingLaunchCount number of pending grids by Martin Kruli (v1.4) 13 28. 11. 2023

  14. Compilation Issues CC 3.5 or higher -rdc=true Dynamic linking Before host linking With cudadevrt by Martin Kruli (v1.4) 14 28. 11. 2023

  15. Very short kernels The launching overhead becomes more significant Especially if the kernel execution is measured in the order of micro-seconds If many small kernels are executed consecutively, we need to overlap kernel launching and execution CUDA Graphs allow us to pre-record the sequence (graph) of kernel launches (and their synchronizations) and execute them all at once by Martin Kruli (v1.4) 15 28. 11. 2023

  16. Example And assuming there are many epochs (e.g., thousands) constexpr int steps = 20; for (int epoch = 0; epoch < epochs; ++epoch) { for (int i = 0; i < steps; ++i) { stencilKernel<<<..., stream>>>(...); } smoothKernel<<<..., stream>>>(...); cudaStreamSynchronize(stream); ... } Assuming the kernels will take microseconds to execute by Martin Kruli (v1.4) 16 28. 11. 2023

  17. Graph holds the structure, instance is ready to be executed cudaGraph_t graph; cudaGraphExec_t instance; bool graphCreated = false; for (int epoch = 0; epoch < epochs; ++epoch) { if (!graphCreated) { cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal); for (int i = 0; i < steps; ++i) { stencilKernel<<<..., stream>>>(...); } smoothKernel<<<..., stream>>>(...); cudaStreamEndCapture(stream, &graph); cudaGraphInstantiate(&instance, graph, 0); graphCreated = true; } cudaGraphLaunch(instance, stream); cudaStreamSynchronize(stream); } Not actually executed, only captured for the graph All captured kernels launched (in order) at once by Martin Kruli (v1.4) 17 28. 11. 2023

  18. CUDA Graphs May be more complex and non-linear As an alternative to stream-capturing, the graph can be constructed manually using API calls Graph node types Kernel execution CPU function call Memory copy, memset Recording an event, waiting for an event Signaling/waiting on an external semaphore Child (nested) graph by Martin Kruli (v1.4) 18 28. 11. 2023

  19. by Martin Kruli (v1.4) 19 28. 11. 2023

  20. Martin Kruli by Martin Kruli (v1.4) 20 28. 11. 2023

  21. Wide variety available Core NVIDIA libraries Libcu++, CUB, Thrust Math libraries cuBLAS, cuFFT, cuSolver AI libraries cuDNN Other specialized libraries PhysX, Libraries with higher abstraction (and possibly portability) by Martin Kruli (v1.4) 21 28. 11. 2023

  22. Standard C++ library for CUDA Header-only lib for heterogeneous support of C++ STL Extended API with more elaborate sync. primitives and other constructs Part of CUDA Toolkit and NVIDIA HPC SDK also in CUDA Core C++ libraries https://github.com/NVIDIA/cccl #include <atomic> std::atomic<int> x; Std. C++ (host only) #include <cuda/std/atomic> cuda::std::atomic<int> x; CUDA C++ (host and device, conforming to standard) #include <cuda/atomic> cuda::atomic<int, cuda::thread_scope_block> x; CUDA C++ (host and device, extension) by Martin Kruli (v1.4) 22 28. 11. 2023

  23. Standard API Time <cuda/std/chrono> Numerics <cuda/std/complex>, <cuda/std/complex> <cuda/std/ratio>, <cuda/std/cfloat> <cuda/std/climits>, <cuda/std/cstdint> Utility <cuda/std/type_traits> <cuda/std/tuple> <cuda/std/functional> <cuda/std/utility> <cuda/std/version> by Martin Kruli (v1.4) 23 28. 11. 2023

  24. Synchronization Atomics (cuda::atomic, cuda::atomic_ref) Latches (cuda::latch) Barriers (cuda::barrier) Semaphores (cuda::counting_semaphore, cuda::binary_semaphore) Pipelines (cuda::pipeline) Coordination mechanism for sequencing async. operations (in stages) Acquire, commit operations, wait, release Can be used for individual threads as well as thread groups by Martin Kruli (v1.4) 24 28. 11. 2023

  25. Async operations cuda::memcpy_async Thread groups and memory model abstraction struct ThreadGroup cuda::thread_scope Memory access properties cuda::annotated_ptr cuda::access_property::normal, shared, global, persisting, streaming by Martin Kruli (v1.4) 25 28. 11. 2023

  26. CUB reusable components Also in CUDA Core C++ libraries Basic parallel collective primitives Warp/block/device-wide Prefix-scan, sort, reduction, cooperative I/O, histogram Optimized for the target architecture, supports dynamic parallelism, Utilities Iterators Thread and block I/O (with cache management) PTX intrinsics Device, execution, and storage management by Martin Kruli (v1.4) 26 28. 11. 2023

  27. CUDA Thrust Also in CUDA Core C++ libraries C++ template library based on STL API The basic idea is to develop C++ parallel applications with minimal overhead (with higher-level abstraction) STL-like vectors (for host and device) and vector algorithms thrust::host_vector<double> h_vec(N); thrust::generate(h_vec.begin(), h_vec.end(), [&] { return ... }); thrust::device_vector<double> d_vec = h_vec; Yes, this is all we need to allocate the device buffer and copy the data by Martin Kruli (v1.4) 27 28. 11. 2023

  28. Algorithms Copying (scatter, gather), merging, prefix-sums, reductions, reordering, searching, sorting, setters, generators, transformations, Govern by execution policies (thrust::host, thrust::device) Containers, memory management thrust::device_vector, device_allocator, memory resources Iterators Counting, permutation, zip, Others Basic types, system tools, numerics, random numbers, by Martin Kruli (v1.4) 28 28. 11. 2023

  29. CUDA Basic Linear Algebra Subroutines CUDA implementation of standard BLAS library Complete support of all 152 functions on vectors/matrices copy, move, rotate, swap maximum, minimum, multiply by scalar sum, dot products, Euclidean norms matrix multiplications, inverses, linear combinations Some operations have batch versions Supports floats, doubles, and complex numbers by Martin Kruli (v1.4) 29 28. 11. 2023

  30. CUDA Sparse Linear Algebra Open source C++ library for sparse linear structures (matrices, linear systems, ) Key features Sparse matrix operations (add, subtraction, max independent set, polynomial relaxation, ) Supports various matrix formats COO, CSR, DIA, ELL, and HYB Require CUDA CC 2.0 or higher by Martin Kruli (v1.4) 30 28. 11. 2023

  31. CUDA Fast Fourier Transform Decompose signal to the frequency spectrum 1-3D transforms (up to 128M elements) Many variations (precision, complex/real types, ) API similar to the FFTW library Create a plan (cufftHandle) which holds the configuration Associate/allocate workspace (buffers) cufftExecC2C() (or R2C, C2R) starts execution FFT plan can be associated with the CUDA stream For synchronization and overlapping by Martin Kruli (v1.4) 31 28. 11. 2023

  32. GPU AI for Board Games Specific AI library designed for games with large, but well-defined configuration space Requires CUDA CC 2.0 Currently supports Game Tree Split alpha/beta pruning Single and multiple recursion (with large depths) Zero-sum games (3D Tic-Tac-Toe, Reversi, ) Sudoku backtracking generator and solver Statistical simulations (Monte Carlo for Go) by Martin Kruli (v1.4) 32 28. 11. 2023

  33. PhysX Real-time physics engine Originally developed by Ageia for PPU card NVIDIA bought it and re-implemented it for CUDA Most important features Simulation of rigid bodies (collisions, destruction) Cloths and fluid particle systems APEX A framework built on top of PhysX Designed for easy usage (artists, games, ) by Martin Kruli (v1.4) 33 28. 11. 2023

  34. by Martin Kruli (v1.4) 34 28. 11. 2023

  35. Martin Kruli by Martin Kruli (v1.4) 35 28. 11. 2023

  36. Interoperability Allows CUDA code to read/write graphical buffers Works with OpenGL and Direct3D libraries Motivation Direct visualization of complex simulations Augmenting 3D rendering with visualization routines which are difficult to implement in shaders How it works The graphics resource is registered and represented by struct cudaGraphicResource The resource may be mapped to CUDA memory space cudaGraphicsMapResources(), by Martin Kruli (v1.4) 36 28. 11. 2023

  37. Initialization Device must be selected by cudaGLSetGLDevice() Resources cudaGraphicsGLRegisterBuffer() for buffers The mapped buffers can be accessed in the same way as CUDA allocated memory cudaGraphicsGLRegisterImage() for images and render buffers The image buffers can be also accessed through texture and surface mechanisms Examples Examples by Martin Kruli (v1.4) 37 28. 11. 2023

  38. Direct3D Support Versions 9, 10, and 11 are supported Each version has its own API CUDA context may operate with one Direct3D device at a time And special HW mode must be set on the device Initialization is similar to OpenGL cudaD3D[9|10|11]SetDirect3DDevice() Available Direct3D resources Buffers, textures, and surfaces All using cudaGraphicsD3DXXRegisterResource() by Martin Kruli (v1.4) 38 28. 11. 2023

  39. by Martin Kruli (v1.4) 39 28. 11. 2023

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#