GPU Acceleration in ITK v4 Overview
This presentation by Won-Ki Jeong from Harvard University at the ITK v4 winter meeting in 2011 discusses the implementation and advantages of GPU acceleration in ITK v4. Topics covered include the use of GPUs as co-processors for massively parallel processing, memory and process management, new GPU classes introduced, and the setup in CMake for GPU code. It also outlines the current status, future work, and platforms supported for GPU acceleration in ITK v4.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
GPU Acceleration in ITK v4 ITK v4 winter meeting Feb 2nd2011 Won-Ki Jeong, Harvard University (wkjeong@seas.harvard.edu)
Overview Introduction Current status in GPU ITK v4 GPU managers GPU image GPU image filters Examples Future work 2
GPU Acceleration GPU as a fast co-processor Massively parallel Huge speed up for certain types of problem Physically independent system Problems Memory management Process management Implementation 3
Quick Summary Is GPU Draft Implementation done? Yes: http://review.source.kitware.com/#change,800 What do we have now? Basic GPU computing framework GPU image and filter class Pipeline and Object Factory supports Basic CMake setup for GPU code 5
CMake Setup ITK_USE_GPU OFF by default Self-contained in Code/GPU except a few minor modification of existing files OpenCL source file location binary_dir/Code/GPU binary_dir is written into pathToOpenCLSourceCode.h 6
Platforms: NVIDIA, ATI, Intel Devices Context Programs Command Queues Kernels Kernels GPU Images 7
New GPU Classes GPUContextManager GPUDataManager GPUImageDataManager GPUKernelManager Basic GPU Objects GPUImage GPUImageToImageFilter GPUMeanImageFilter ITK Object Extension 8
GPU Context Manager Global GPU resource manager One instance per process All GPU objects should have a pointer to it Resources Platforms Devices Contexts Command queues GetCommandQueue(), GetNumCommandQueue() 9
GPU Data Manager Base class to manage GPU memory GPU data container Synchronization between CPU & GPU memory Synchronization APIs: Data Container APIs: SetCPUDirtyFlag() SetGPUDirtyFlag() SetCPUBufferDirty() SetGPUBufferDirty() MakeCPUBufferUpToDate() MakeGPUBufferUpToDate() MakeUpToDate() SetBufferSize() SetCPUBufferPointer() Allocate() protected: GetGPUBufferPointer() 10
Synchronization Dirty flags Lightweight Pixel access functions in GPU images Time stamp Better to use sparingly Pipeline 11
Data Manager Example unsigned int arraySize = 100; // create CPU memory float *a = new float[arraySize]; // create GPU memory GPUDataManager::Pointer b = GPUDataManager::New(); b->SetBufferSize(arraySize*sizeof(float)); b->SetCPUBufferPointer(a); b->Allocate(); // change values in CPU memory a[10] = 8; // mark GPU as dirty and synchronize CPU -> GPU b->SetGPUBufferDirty(); b->MakeUpToDate(); 12
Data Manager Example (contd) // change values in GPU memory ... (run GPU kernel) // mark CPU as dirty and synchronize GPU -> CPU b->SetCPUBufferDirty(); b->MakeUpToDate(); 13
Create Your Own GPU Data Manager GPUDataManager GPUImageDataManager GPUMeshDataManager GPUVideoDataManager SetImagePointer() MakeCPUBufferUpToDate( ) MakeGPUBufferUpToDate( ) .... .... 14
GPU Image Derived from itk::Image Compatible to existing ITK filters GPUImageDataManager as a member Separate GPU implementation from Image class Implicit(automatic) synchronization Override CPU buffer access functions to properly set the dirty buffer flag Provide a single view of CPU/GPU memory 15
GPU Image itk::Image::GPUImage itk::Image FillBuffer() GetPixel() SetPixel() GetBufferPointer() GetPixelAccessor() GetNeighborhoodAccessor() ... FillBuffer() GetPixel() SetPixel() GetBufferPointer() GetPixelAccessor() GetNeighborhoodAccessor() ... GPUImageDataManager CPU Memory SetGPUBufferDirty() MakeUpToDate() 16
GPU Kernel Manager Load and compile GPU source code LoadProgramFromFile() Create GPU kernels CreateKernel() Execute GPU kernels SetKernelArg() SetKernelArgWithImage() LaunchKernel() 17
Kernel Manager Example // create GPU images itk::GPUImage<float,2>::Pointer srcA, srcB, dest; srcA = itk::GPUImage<float,2>::New(); ... // create GPU kernel manager GPUKernelManager::Pointer kernelManager = GPUKernelManager::New(); // load program and compile kernelManager->LoadProgramFromFile( ImageOps.cl , ); "#define PIXELTYPE float\n" // create ADD kernel int kernel_add = kernelManager->CreateKernel("ImageAdd"); 18
Kernel Manager Example(contd) unsigned int nElem = 256*256; // set kernel arguments kernelManager->SetKernelArgWithImage(kernel_add, 0, kernelManager->SetKernelArgWithImage(kernel_add, 1, kernelManager->SetKernelArgWithImage(kernel_add, 2, kernelManager->SetKernelArg(kernel_add, 3, sizeof(unsigned int), &nElem); srcA->GetGPUDataManager()); srcB->GetGPUDataManager()); dest->GetGPUDataManager()); // launch kernel kernelManager->LaunchKernel2D(kernel_add, 256, 256, 16, 16); 19
OpenCL Source Code Example // // pixel by pixel addition of 2D images // __kernel void ImageAdd(__global const PIXELTYPE* a, __global const PIXELTYPE* b, __global PIXELTYPE* c, unsigned int nElem) { unsigned int width = get_global_size(0); unsigned int gix = get_global_id(0); unsigned int giy = get_global_id(1); unsigned int gidx = giy*width + gix; // bound check if (gidx < nElem) { c[gidx] = a[gidx] + b[gidx]; } } 20
GPUImageToImageFilter Base class for GPU image filter Extend existing itk filters using CRTP template< class TInputImage, class TOutputImage, class TParentImageFilter > class ITK_EXPORT GPUImageToImageFilter: public TParentImageFilter { ... } Turn on/off GPU filter IsGPUEnabled(bool) GPU filter implementation GPUGenerateData() 21
Create Your Own GPU Image Filter Step 1: Derive your filter from GPUImageToImageFilter using an existing itk image filter Step 2: Load and compile GPU source code and create kernels in the constructor Step 3: Implement filter by calling GPU kernels in GPUGenerateData() 22
Example: GPUMeanImageFilter Step 1: Class declaration template< class TInputImage, class TOutputImage > class ITK_EXPORT GPUMeanImageFilter : public GPUImageToImageFilter< TInputImage, TOutputImage, MeanImageFilter< TInputImage, TOutputImage > > { ... } 23
Example: GPUMeanImageFilter Step 2: Constructor template< class TInputImage, class TOutputImage > GPUMeanImageFilter< TInputImage, TOutputImage>::GPUMeanImageFilter() { char buf[100]; // OpenCL source path char oclSrcPath[100]; sprintf(oclSrcPath, "%s/Code/GPU/GPUMeanImageFilter.cl", itk_root_path); // load and build OpenCL program m_KernelManager->LoadProgramFromFile( oclSrcPath, buf ); // create GPU kernel m_KernelHandle = m_KernelManager->CreateKernel("MeanFilter"); } Defined in pathToOpenCLSourceCode.h 24
Example: GPUMeanImageFilter Step 3: GPUGenerateData() template< class TInputImage, class TOutputImage > void GPUMeanImageFilter< TInputImage, TOutputImage >::GPUGenerateData() { typedef itk::GPUTraits< TInputImage >::Type GPUInputImage; typedef itk::GPUTraits< TOutputImage >::Type GPUOutputImage; // get input & output image pointer GPUInputImage::Pointer inPtr = dynamic_cast< GPUInputImage * >( this->ProcessObject::GetInput(0) ); GPUOutputImage::Pointer otPtr = dynamic_cast< GPUOutputImage * >( this->ProcessObject::GetOutput(0) ); GPUOutputImage::SizeType outSize = otPtr->GetLargestPossibleRegion().GetSize(); int radius[3], imgSize[3]; for(int i=0; i<(int)TInputImage::ImageDimension; i++) { radius[i] = (this->GetRadius())[i]; imgSize[i] = outSize[i]; } 25
(Continued..) size_t localSize[2], globalSize[2]; localSize[0] = localSize[1] = 16; globalSize[0] = localSize[0]*(unsigned int)ceil((float)outSize[0]/(float)localSize[0]); globalSize[1] = localSize[1]*(unsigned int)ceil((float)outSize[1]/(float)localSize[1]); // kernel arguments set up int argidx = 0; m_KernelManager->SetKernelArgWithImage(m_KernelHandle, argidx++, inPtr->GetGPUDataManager()); m_KernelManager->SetKernelArgWithImage(m_KernelHandle, argidx++, otPtr->GetGPUDataManager()); for(int i=0; i<(int)TInputImage::ImageDimension; i++) m_KernelManager->SetKernelArg(m_KernelHandle, argidx++, sizeof(int), &(radius[i])); for(int i=0; i<(int)TInputImage::ImageDimension; i++) m_KernelManager->SetKernelArg(m_KernelHandle, argidx++, sizeof(int), &(imgSize[i])); // launch kernel m_KernelManager->LaunchKernel(m_KernelHandle, (int)TInputImage::ImageDimension, globalSize, localSize); } 26
Pipeline Support Allow combining CPU and GPU filters Efficient CPU/GPU synchronization Currently ImageToImageFilter is supported ReaderType::Pointer reader = ReaderType::New(); WriterType::Pointer writer = WriterType::New(); GPUMeanFilterType::Pointer filter1 = GPUMeanFilterType::New(); GPUMeanFilterType::Pointer filter2 = GPUMeanFilterType::New(); ThresholdFilterType::Pointer filter3 = ThresholdFilterType::New(); (GPU) (GPU) (CPU) Reader Filter1 Filter2 Filter3 (CPU) Writer (CPU) filter1->SetInput( reader->GetOutput() ); // copy CPU->GPU implicitly filter2->SetInput( filter1->GetOutput() ); filter3->SetInput( filter2->GetOutput() ); writer->SetInput( filter3->GetOutput() ); // copy GPU->CPU implicitly Synchronize Synchronize writer->Update(); 28
Complicated Filter Design Multiple kernel launches Single kernel, multiple calls Multiple kernels Design choices Each kernel is a filter, pipelining Reusable, memory overhead Put multiple kernels in a single filter Not reusable, less memory overhead 29
Object Factory Support Create GPU object when possible No need to explicitly define GPU objects // register object factory for GPU image and filter objects ObjectFactoryBase::RegisterFactory(GPUImageFactory::New()); ObjectFactoryBase::RegisterFactory(GPUMeanImageFilterFactory::New()); typedef itk::Image< InputPixelType, 2 > InputImageType; typedef itk::Image< OutputPixelType, 2 > OutputImageType; typedef itk::MeanImageFilter< InputImageType, OutputImageType > MeanFilterType::Pointer filter = MeanFilterType::New(); MeanFilterType; 30
Type Casting Image must be casted to GPUImage for auto-synchronization for non-pipelined workflow with object factory Use GPUTraits template <class T> class GPUTraits { public: typedef T Type; }; template <class T, unsigned int D> class GPUTraits< Image< T, D > > { public: typedef GPUImage<T,D> Type; }; InputImageType::Pointer img; typedef itk::GPUTraits< InputImageType >::Type GPUImageType; GPUImageType::Pointer otPtr = dynamic_cast< GPUImageType* >( img ); 31
Examples test_itkGPUImage.cxx Simple image algebra Multiple kernels and command queues test_itkGPUImageFilter.cxx GPUMeanImageFilter Pipeline and object factory ctest R gpuImageFilterTest -V 32
ToDo List Multi-GPU support GPUThreadedGenerateData() InPlace filter base class Grafting for GPUImage GPUImage internal types Buffer, image (texture) Basic filters Level set, registration, etc 33
Useful Links Current code in gerrit http://review.source.kitware.com/#change,800 OpenCL http://www.khronos.org/opencl/ http://www.nvidia.com/object/cuda_opencl_new.html http://developer.amd.com/zones/OpenCLZone/pages/de fault.aspx http://software.intel.com/en-us/articles/intel-opencl-sdk/ 34