Parallelism in GPU Computing by Martin Kruli

undefined

Martin Kruli

š

28. 11. 2023

by Martin Kruliš (v1.4)



Different Types of Parallelism

◦

Task parallelism



The problem is divided into tasks, which are processed independently

◦

Data parallelism



The same operation is performed over many data items

◦

Pipeline parallelism



Data are flowing through a sequence (or oriented graph) of stages, which

operate concurrently

◦

Other types of parallelism



Event-driven, …

28. 11. 2023

by Martin Kruliš (v1.4)



Parallelism in GPU

◦

Data parallelism



The same kernel is executed by many threads



Thread process one data item

◦

Limited task parallelism



Multiple kernels executed simultaneously (since Fermi)



At most as many kernels as SMPs

◦

But we do not have…



Any guarantees that two blocks/kernels will actually run concurrently



Efficient means of synchronization outside the block



Yes, there are cooperative groups, but sync cost skyrockets when crossing the

block-wise boundary

28. 11. 2023

by Martin Kruliš (v1.4)



Unsuitable Problems for GPUs

◦

Processing irregular data structures



Trees, graphs, …

◦

Regular structures with irregular processing workload



Difficult simulations, iterative approximations

◦

Iterative tasks with explicit synchronization

◦

Pipeline-oriented tasks with many simple stages

28. 11. 2023

by Martin Kruliš (v1.4)



Solutions

◦

Iterative/consecutive kernel execution



Usually applicable only for cases when there are none or few dependencies

between the subsequent kernels



The state (or most of it) is kept on the GPU

◦

Mapping irregular structures to regular grids



May be too fine/coarse grained



Not always possible

◦

2-phase Algorithms



First phase determines the amount of work (items, …)



Second phase process tasks mapped by first phase

28. 11. 2023

by Martin Kruliš (v1.4)



Persistent Threads

◦

Single kernel executed so there are as many thread-blocks as SMPs

◦

Workload is generated and processed on the GPU



Using global queue implemented by atomic operations

◦

Possible problems



The kernel may be quite complex



There are no guarantees that two blocks will ever run concurrently



Possibility of creating deadlock



The synchronization overhead is significant



But it might be lower than host-based synchronization

28. 11. 2023

by Martin Kruliš (v1.4)

28. 11. 2023

by Martin Kruliš (v1.4)



CUDA Dynamic Parallelism

◦

Presented in CC 3.5 (Kepler)

◦

GPU threads are allowed to launch new grids

28. 11. 2023

by Martin Kruliš (v1.4)



Dynamic Parallelism Purpose

◦

The device does not need to synchronize with host to issue new work

to the device

◦

Irregular parallelism may be expressed more easily

28. 11. 2023

by Martin Kruliš (v1.4)



How It Works

◦

Portions of CUDA runtime are ported to the device



Kernel execution



Device synchronization



Streams, events, and async. memory operations

◦

Kernel launches are asynchronous



No guarantee the child kernel starts immediately



Synchronization points may cause a context switch



Context of an entire block has to be switched off an SMP

◦

Block-wise locality of resources



Streams and events are valid within a thread block

28. 11. 2023

by Martin Kruliš (v1.4)



Example

__global__

void

child_launch(

int

 *data) {

data[threadIdx.x] = data[threadIdx.x]+1;

__global__

void

parent_launch(

int

 *data) {

data[threadIdx.x] = threadIdx.x;

__syncthreads

();

if

(threadIdx.x == 0) {

child_launch<<< 1, 256 >>>

(data);

cudaDeviceSynchronize();

__syncthreads

();

void

host_launch(

int

 *data) {

parent_launch<<< 1, 256 >>>(data);

28. 11. 2023

by Martin Kruliš (v1.4)

Synchronization does not have to be invoked by all threads

Thread 0 invokes a grid of child threads

Device synchronization does not synchronize threads in the block



Memory Consistency

28. 11. 2023

by Martin Kruliš (v1.4)



Depth Limits

◦

Nesting depth (depth of the recursion)

◦

Synchronization depth (deepest level where

cudaDeviceSynchronize()

 is invoked)

◦

cudaDeviceSetLimit()



cudaLimitDevRuntimeSyncDepth



cudaLimitDevRuntimePendingLaunchCount

number of pending grids

28. 11. 2023

by Martin Kruliš (v1.4)



Compilation Issues

◦

CC 3.5 or higher

◦

-rdc=true

◦

Dynamic linking



Before host linking



With

cudadevrt

28. 11. 2023

by Martin Kruliš (v1.4)



Very short kernels

◦

The launching overhead becomes more

significant



Especially if the kernel execution is measured in

the order of micro-seconds

◦

If many small kernels are executed

consecutively, we need to overlap kernel

launching and execution

◦

CUDA Graphs allow us to pre-record the

sequence (graph) of kernel launches (and

their synchronizations) and execute them all

at once

28. 11. 2023

by Martin Kruliš (v1.4)



Example

constexpr

int

 steps = 20;

for

int

 epoch = 0; epoch < epochs; ++epoch) {

for

int

 i = 0; i < steps; ++i) {

stencilKernel

<<<...,

stream

>>>(...);

smoothKernel

<<<...,

stream

>>>(...);

  cudaStreamSynchronize(

stream

);

...

28. 11. 2023

by Martin Kruliš (v1.4)

Assuming the kernels will take

microseconds to execute

And assuming there are many

epochs (e.g., thousands)

cudaGraph_t

 graph;

cudaGraphExec_t

 instance;

bool

 graphCreated =

false

for

int

 epoch = 0; epoch < epochs; ++epoch) {

if

 (!graphCreated) {

cudaStreamBeginCapture

(stream, cudaStreamCaptureModeGlobal);

for

int

 i = 0; i < steps; ++i) {

stencilKernel

<<<..., stream>>>(...);

smoothKernel

<<<..., stream>>>(...);

cudaStreamEndCapture

(stream, &graph);

cudaGraphInstantiate

(&instance, graph, 0);

    graphCreated =

true

cudaGraphLaunch

(instance, stream);

  cudaStreamSynchronize(stream);

28. 11. 2023

by Martin Kruliš (v1.4)

Graph holds the structure,

instance is ready to be executed

Not actually executed, only captured for the graph

All captured kernels launched (in order) at once



CUDA Graphs

◦

May be more complex and non-linear



As an alternative to stream-capturing, the graph can be constructed

manually using API calls

◦

Graph node types



Kernel execution



CPU function call



Memory copy, memset



Recording an event, waiting for an event



Signaling/waiting on an external semaphore



Child (nested) graph

28. 11. 2023

by Martin Kruliš (v1.4)

28. 11. 2023

by Martin Kruliš (v1.4)

undefined

Martin Kruli

š

28. 11. 2023

by Martin Kruliš (v1.4)



Wide variety available



Core NVIDIA libraries

◦

Libcu++, CUB, Thrust



Math libraries

◦

cuBLAS, cuFFT, cuSolver



AI libraries

◦

cuDNN



Other specialized libraries

◦

PhysX,



Libraries with higher abstraction (and possibly portability)

28. 11. 2023

by Martin Kruliš (v1.4)



Standard C++ library for CUDA

◦

Header-only lib for heterogeneous support of C++ STL

◦

Extended API with more elaborate sync. primitives and other constructs

◦

Part of CUDA Toolkit and NVIDIA HPC SDK

also in CUDA Core C++ libraries

https://github.com/NVIDIA/cccl

#include <atomic>

std

::atomic<int> x;

#include <cuda/std/atomic>

cuda::std

::atomic<int> x;

#include <cuda/atomic>

cuda

::atomic<int, cuda::thread_scope_block> x;

28. 11. 2023

by Martin Kruliš (v1.4)

Std. C++ (host only)

CUDA C++ (host

and device, conforming to standard

CUDA C++ (host

and device, extension



Standard API

◦

Time

<cuda/std/chrono>

◦

Numerics

<cuda/std/complex>, <cuda/std/complex>

<cuda/std/ratio>, <cuda/std/cfloat>

<cuda/std/climits>, <cuda/std/cstdint>

◦

Utility

<cuda/std/type_traits>

<cuda/std/tuple>

<cuda/std/functional>

<cuda/std/utility>

<cuda/std/version>

28. 11. 2023

by Martin Kruliš (v1.4)



Synchronization

◦

Atomics (

cuda::atomic

cuda::atomic_ref

◦

Latches (

cuda::latch

◦

Barriers (

cuda::barrier

◦

Semaphores (

cuda::counting_semaphore

cuda::binary_semaphore

◦

Pipelines (

cuda::pipeline



Coordination mechanism for sequencing async. operations (in stages)



Acquire, commit operations, wait, release



Can be used for individual threads as well as thread groups

28. 11. 2023

by Martin Kruliš (v1.4)



Async operations

◦

cuda::memcpy_async



Thread groups and memory model abstraction

◦

struct ThreadGroup

◦

cuda::thread_scope



Memory access properties

◦

cuda::annotated_ptr

◦

cuda::access_property::normal

shared

global

persisting

streaming

28. 11. 2023

by Martin Kruliš (v1.4)



CUB reusable components

◦

Also in CUDA Core C++ libraries

◦

Basic parallel collective primitives



Warp/block/device-wide



Prefix-scan, sort, reduction, cooperative I/O, histogram…



Optimized for the target architecture, supports dynamic parallelism, …

◦

Utilities



Iterators



Thread and block I/O (with cache management)



PTX intrinsics



Device, execution, and storage management

28. 11. 2023

by Martin Kruliš (v1.4)



CUDA Thrust

◦

Also in CUDA Core C++ libraries

◦

C++ template library based on STL API

◦

The basic idea is to develop C++ parallel applications with minimal

overhead (with higher-level abstraction)

◦

STL-like vectors (for host and device) and vector algorithms

thrust::host_vector<double> h_vec(N);

thrust::generate(h_vec.begin(), h_vec.end(), [&] { return ... });

thrust::device_vector<double> d_vec = h_vec;

28. 11. 2023

by Martin Kruliš (v1.4)

Yes, this is all we need to allocate the

device buffer and copy the data…



Algorithms

◦

Copying (scatter, gather), merging, prefix-sums, reductions,

reordering, searching, sorting, setters, generators, transformations, …

◦

Govern by execution policies (thrust::host, thrust::device)



Containers, memory management

◦

thrust::device_vector

, device_allocator

, memory resources



Iterators

◦

Counting, permutation, zip, …



Others

◦

Basic types, system tools,

numerics, random numbers, …

28. 11. 2023

by Martin Kruliš (v1.4)



CUDA Basic Linear Algebra Subroutines

◦

CUDA implementation of standard BLAS library

◦

Complete support of all 152 functions on vectors/matrices



copy, move, rotate, swap



maximum, minimum, multiply by scalar



sum, dot products, Euclidean norms



matrix multiplications, inverses, linear combinations

◦

Some operations have batch versions

◦

Supports floats, doubles, and complex numbers

28. 11. 2023

by Martin Kruliš (v1.4)



CUDA Sparse Linear Algebra

◦

Open source C++ library for sparse linear structures (matrices, linear

systems, …)

◦

Key features



Sparse matrix operations (add, subtraction, max independent set,

polynomial relaxation, …)



Supports various matrix formats



COO, CSR, DIA, ELL, and HYB

◦

Require CUDA CC 2.0 or higher

28. 11. 2023

by Martin Kruliš (v1.4)



CUDA Fast Fourier Transform

◦

Decompose signal to the frequency spectrum

◦

1-3D transforms (up to 128M elements)

◦

Many variations (precision, complex/real types, …)

◦

API similar to the FFTW library



Create a plan (

cufftHandle

) which holds the configuration



Associate/allocate workspace (buffers)



cufftExecC2C

()

(or

R2C

C2R

) starts execution

◦

FFT plan can be associated with the CUDA stream



For synchronization and overlapping

28. 11. 2023

by Martin Kruliš (v1.4)



GPU AI for Board Games

◦

Specific AI library designed for games with large, but well-defined

configuration space

◦

Requires CUDA CC 2.0

◦

Currently supports



Game Tree Split – alpha/beta pruning



Single and multiple recursion (with large depths)



Zero-sum games (3D Tic-Tac-Toe, Reversi, …)



Sudoku backtracking generator and solver



Statistical simulations (Monte Carlo for Go)

28. 11. 2023

by Martin Kruliš (v1.4)



PhysX

◦

Real-time physics engine

◦

Originally developed by Ageia for PPU card



NVIDIA bought it and re-implemented it for CUDA

◦

Most important features



Simulation of rigid bodies (collisions, destruction)



Cloths and fluid particle systems



APEX

◦

A framework built on top of PhysX

◦

Designed for easy usage (artists, games, …)

28. 11. 2023

by Martin Kruliš (v1.4)

28. 11. 2023

by Martin Kruliš (v1.4)

undefined

Martin Kruli

š

28. 11. 2023

by Martin Kruliš (v1.4)



Interoperability

◦

Allows CUDA code to read/write graphical buffers



Works with OpenGL and Direct3D libraries

◦

Motivation



Direct visualization of complex simulations



Augmenting 3D rendering with visualization routines which are difficult to

implement in shaders

◦

How it works



The graphics resource is registered and represented by

struct

cudaGraphicResource



The resource may be mapped to CUDA memory space

cudaGraphicsMapResources()

, …

28. 11. 2023

by Martin Kruliš (v1.4)



Initialization

◦

Device must be selected by

cudaGLSetGLDevice()



Resources

◦

cudaGraphicsGLRegisterBuffer()

 for buffers



The mapped buffers can be accessed in the same way as CUDA allocated

memory

◦

cudaGraphicsGLRegisterImage()

 for images and render buffers



The image buffers can be also accessed through texture and surface

mechanisms

28. 11. 2023

by Martin Kruliš (v1.4)

Examples



Direct3D Support

◦

Versions 9, 10, and 11 are supported



Each version has its own API

◦

CUDA context may operate with one Direct3D device at a time



And special HW mode must be set on the device

◦

Initialization is similar to OpenGL

cudaD3D[9|10|11]SetDirect3DDevice()

◦

Available Direct3D resources



Buffers, textures, and surfaces



All using

cudaGraphicsD3D

XX

RegisterResource

()

28. 11. 2023

by Martin Kruliš (v1.4)

28. 11. 2023

by Martin Kruliš (v1.4)

Slide Note

Embed Share

Download

This content delves into different types of parallelism in GPU computing, such as task parallelism and data parallelism, along with discussing unsuitable problems for GPUs and providing solutions like iterative kernel execution and mapping irregular structures to regular grids. The article also touches on challenges related to synchronization and workload processing on GPUs.

umar Follow

Uploaded on Apr 03, 2024 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Martin Kruli by Martin Kruli (v1.4) 1 28. 11. 2023

Different Types of Parallelism Task parallelism The problem is divided into tasks, which are processed independently Data parallelism The same operation is performed over many data items Pipeline parallelism Data are flowing through a sequence (or oriented graph) of stages, which operate concurrently Other types of parallelism Event-driven, by Martin Kruli (v1.4) 2 28. 11. 2023

Parallelism in GPU Data parallelism The same kernel is executed by many threads Thread process one data item Limited task parallelism Multiple kernels executed simultaneously (since Fermi) At most as many kernels as SMPs But we do not have Any guarantees that two blocks/kernels will actually run concurrently Efficient means of synchronization outside the block Yes, there are cooperative groups, but sync cost skyrockets when crossing the block-wise boundary by Martin Kruli (v1.4) 3 28. 11. 2023

Unsuitable Problems for GPUs Processing irregular data structures Trees, graphs, Regular structures with irregular processing workload Difficult simulations, iterative approximations Iterative tasks with explicit synchronization Pipeline-oriented tasks with many simple stages by Martin Kruli (v1.4) 4 28. 11. 2023

Solutions Iterative/consecutive kernel execution Usually applicable only for cases when there are none or few dependencies between the subsequent kernels The state (or most of it) is kept on the GPU Mapping irregular structures to regular grids May be too fine/coarse grained Not always possible 2-phase Algorithms First phase determines the amount of work (items, ) Second phase process tasks mapped by first phase by Martin Kruli (v1.4) 5 28. 11. 2023

Persistent Threads Single kernel executed so there are as many thread-blocks as SMPs Workload is generated and processed on the GPU Using global queue implemented by atomic operations Possible problems The kernel may be quite complex There are no guarantees that two blocks will ever run concurrently Possibility of creating deadlock The synchronization overhead is significant But it might be lower than host-based synchronization by Martin Kruli (v1.4) 6 28. 11. 2023

by Martin Kruli (v1.4) 7 28. 11. 2023

CUDA Dynamic Parallelism Presented in CC 3.5 (Kepler) GPU threads are allowed to launch new grids by Martin Kruli (v1.4) 8 28. 11. 2023

Dynamic Parallelism Purpose The device does not need to synchronize with host to issue new work to the device Irregular parallelism may be expressed more easily by Martin Kruli (v1.4) 9 28. 11. 2023

How It Works Portions of CUDA runtime are ported to the device Kernel execution Device synchronization Streams, events, and async. memory operations Kernel launches are asynchronous No guarantee the child kernel starts immediately Synchronization points may cause a context switch Context of an entire block has to be switched off an SMP Block-wise locality of resources Streams and events are valid within a thread block by Martin Kruli (v1.4) 10 28. 11. 2023

Example __global__ void child_launch(int *data) { data[threadIdx.x] = data[threadIdx.x]+1; } __global__ void parent_launch(int *data) { data[threadIdx.x] = threadIdx.x; __syncthreads(); if (threadIdx.x == 0) { child_launch<<< 1, 256 >>>(data); cudaDeviceSynchronize(); } __syncthreads(); } Thread 0 invokes a grid of child threads Synchronization does not have to be invoked by all threads Device synchronization does not synchronize threads in the block void host_launch(int *data) { parent_launch<<< 1, 256 >>>(data); } by Martin Kruli (v1.4) 11 28. 11. 2023

Memory Consistency by Martin Kruli (v1.4) 12 28. 11. 2023

Depth Limits Nesting depth (depth of the recursion) Synchronization depth (deepest level where cudaDeviceSynchronize() is invoked) cudaDeviceSetLimit() cudaLimitDevRuntimeSyncDepth cudaLimitDevRuntimePendingLaunchCount number of pending grids by Martin Kruli (v1.4) 13 28. 11. 2023

Compilation Issues CC 3.5 or higher -rdc=true Dynamic linking Before host linking With cudadevrt by Martin Kruli (v1.4) 14 28. 11. 2023

Very short kernels The launching overhead becomes more significant Especially if the kernel execution is measured in the order of micro-seconds If many small kernels are executed consecutively, we need to overlap kernel launching and execution CUDA Graphs allow us to pre-record the sequence (graph) of kernel launches (and their synchronizations) and execute them all at once by Martin Kruli (v1.4) 15 28. 11. 2023

Example And assuming there are many epochs (e.g., thousands) constexpr int steps = 20; for (int epoch = 0; epoch < epochs; ++epoch) { for (int i = 0; i < steps; ++i) { stencilKernel<<<..., stream>>>(...); } smoothKernel<<<..., stream>>>(...); cudaStreamSynchronize(stream); ... } Assuming the kernels will take microseconds to execute by Martin Kruli (v1.4) 16 28. 11. 2023

Graph holds the structure, instance is ready to be executed cudaGraph_t graph; cudaGraphExec_t instance; bool graphCreated = false; for (int epoch = 0; epoch < epochs; ++epoch) { if (!graphCreated) { cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal); for (int i = 0; i < steps; ++i) { stencilKernel<<<..., stream>>>(...); } smoothKernel<<<..., stream>>>(...); cudaStreamEndCapture(stream, &graph); cudaGraphInstantiate(&instance, graph, 0); graphCreated = true; } cudaGraphLaunch(instance, stream); cudaStreamSynchronize(stream); } Not actually executed, only captured for the graph All captured kernels launched (in order) at once by Martin Kruli (v1.4) 17 28. 11. 2023

CUDA Graphs May be more complex and non-linear As an alternative to stream-capturing, the graph can be constructed manually using API calls Graph node types Kernel execution CPU function call Memory copy, memset Recording an event, waiting for an event Signaling/waiting on an external semaphore Child (nested) graph by Martin Kruli (v1.4) 18 28. 11. 2023

by Martin Kruli (v1.4) 19 28. 11. 2023

Martin Kruli by Martin Kruli (v1.4) 20 28. 11. 2023

Wide variety available Core NVIDIA libraries Libcu++, CUB, Thrust Math libraries cuBLAS, cuFFT, cuSolver AI libraries cuDNN Other specialized libraries PhysX, Libraries with higher abstraction (and possibly portability) by Martin Kruli (v1.4) 21 28. 11. 2023

Standard C++ library for CUDA Header-only lib for heterogeneous support of C++ STL Extended API with more elaborate sync. primitives and other constructs Part of CUDA Toolkit and NVIDIA HPC SDK also in CUDA Core C++ libraries https://github.com/NVIDIA/cccl #include <atomic> std::atomic<int> x; Std. C++ (host only) #include <cuda/std/atomic> cuda::std::atomic<int> x; CUDA C++ (host and device, conforming to standard) #include <cuda/atomic> cuda::atomic<int, cuda::thread_scope_block> x; CUDA C++ (host and device, extension) by Martin Kruli (v1.4) 22 28. 11. 2023

Standard API Time <cuda/std/chrono> Numerics <cuda/std/complex>, <cuda/std/complex> <cuda/std/ratio>, <cuda/std/cfloat> <cuda/std/climits>, <cuda/std/cstdint> Utility <cuda/std/type_traits> <cuda/std/tuple> <cuda/std/functional> <cuda/std/utility> <cuda/std/version> by Martin Kruli (v1.4) 23 28. 11. 2023

Synchronization Atomics (cuda::atomic, cuda::atomic_ref) Latches (cuda::latch) Barriers (cuda::barrier) Semaphores (cuda::counting_semaphore, cuda::binary_semaphore) Pipelines (cuda::pipeline) Coordination mechanism for sequencing async. operations (in stages) Acquire, commit operations, wait, release Can be used for individual threads as well as thread groups by Martin Kruli (v1.4) 24 28. 11. 2023

Async operations cuda::memcpy_async Thread groups and memory model abstraction struct ThreadGroup cuda::thread_scope Memory access properties cuda::annotated_ptr cuda::access_property::normal, shared, global, persisting, streaming by Martin Kruli (v1.4) 25 28. 11. 2023

CUB reusable components Also in CUDA Core C++ libraries Basic parallel collective primitives Warp/block/device-wide Prefix-scan, sort, reduction, cooperative I/O, histogram Optimized for the target architecture, supports dynamic parallelism, Utilities Iterators Thread and block I/O (with cache management) PTX intrinsics Device, execution, and storage management by Martin Kruli (v1.4) 26 28. 11. 2023

CUDA Thrust Also in CUDA Core C++ libraries C++ template library based on STL API The basic idea is to develop C++ parallel applications with minimal overhead (with higher-level abstraction) STL-like vectors (for host and device) and vector algorithms thrust::host_vector<double> h_vec(N); thrust::generate(h_vec.begin(), h_vec.end(), [&] { return ... }); thrust::device_vector<double> d_vec = h_vec; Yes, this is all we need to allocate the device buffer and copy the data by Martin Kruli (v1.4) 27 28. 11. 2023

Algorithms Copying (scatter, gather), merging, prefix-sums, reductions, reordering, searching, sorting, setters, generators, transformations, Govern by execution policies (thrust::host, thrust::device) Containers, memory management thrust::device_vector, device_allocator, memory resources Iterators Counting, permutation, zip, Others Basic types, system tools, numerics, random numbers, by Martin Kruli (v1.4) 28 28. 11. 2023

CUDA Basic Linear Algebra Subroutines CUDA implementation of standard BLAS library Complete support of all 152 functions on vectors/matrices copy, move, rotate, swap maximum, minimum, multiply by scalar sum, dot products, Euclidean norms matrix multiplications, inverses, linear combinations Some operations have batch versions Supports floats, doubles, and complex numbers by Martin Kruli (v1.4) 29 28. 11. 2023

CUDA Sparse Linear Algebra Open source C++ library for sparse linear structures (matrices, linear systems, ) Key features Sparse matrix operations (add, subtraction, max independent set, polynomial relaxation, ) Supports various matrix formats COO, CSR, DIA, ELL, and HYB Require CUDA CC 2.0 or higher by Martin Kruli (v1.4) 30 28. 11. 2023

CUDA Fast Fourier Transform Decompose signal to the frequency spectrum 1-3D transforms (up to 128M elements) Many variations (precision, complex/real types, ) API similar to the FFTW library Create a plan (cufftHandle) which holds the configuration Associate/allocate workspace (buffers) cufftExecC2C() (or R2C, C2R) starts execution FFT plan can be associated with the CUDA stream For synchronization and overlapping by Martin Kruli (v1.4) 31 28. 11. 2023

GPU AI for Board Games Specific AI library designed for games with large, but well-defined configuration space Requires CUDA CC 2.0 Currently supports Game Tree Split alpha/beta pruning Single and multiple recursion (with large depths) Zero-sum games (3D Tic-Tac-Toe, Reversi, ) Sudoku backtracking generator and solver Statistical simulations (Monte Carlo for Go) by Martin Kruli (v1.4) 32 28. 11. 2023

PhysX Real-time physics engine Originally developed by Ageia for PPU card NVIDIA bought it and re-implemented it for CUDA Most important features Simulation of rigid bodies (collisions, destruction) Cloths and fluid particle systems APEX A framework built on top of PhysX Designed for easy usage (artists, games, ) by Martin Kruli (v1.4) 33 28. 11. 2023

by Martin Kruli (v1.4) 34 28. 11. 2023

Martin Kruli by Martin Kruli (v1.4) 35 28. 11. 2023

Interoperability Allows CUDA code to read/write graphical buffers Works with OpenGL and Direct3D libraries Motivation Direct visualization of complex simulations Augmenting 3D rendering with visualization routines which are difficult to implement in shaders How it works The graphics resource is registered and represented by struct cudaGraphicResource The resource may be mapped to CUDA memory space cudaGraphicsMapResources(), by Martin Kruli (v1.4) 36 28. 11. 2023

Initialization Device must be selected by cudaGLSetGLDevice() Resources cudaGraphicsGLRegisterBuffer() for buffers The mapped buffers can be accessed in the same way as CUDA allocated memory cudaGraphicsGLRegisterImage() for images and render buffers The image buffers can be also accessed through texture and surface mechanisms Examples Examples by Martin Kruli (v1.4) 37 28. 11. 2023

Direct3D Support Versions 9, 10, and 11 are supported Each version has its own API CUDA context may operate with one Direct3D device at a time And special HW mode must be set on the device Initialization is similar to OpenGL cudaD3D[9|10|11]SetDirect3DDevice() Available Direct3D resources Buffers, textures, and surfaces All using cudaGraphicsD3DXXRegisterResource() by Martin Kruli (v1.4) 38 28. 11. 2023

by Martin Kruli (v1.4) 39 28. 11. 2023

Parallelism in GPU Computing by Martin Kruli

Download Presentation

Presentation Transcript

Related

More Related Content