OpenACC Compiler for CUDA: A Source-to-Source Implementation

Slide Note
Embed
Share

An open-source OpenACC compiler designed for NVIDIA GPUs using a source-to-source approach allows for detailed machine-specific optimizations through the mature CUDA compiler. The compiler targets C as the language and leverages the CUDA API, facilitating the generation of executable files.


Uploaded on Oct 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi 1 Masahiro Nakao 2 Mitsuhisa Sato 1 1. Graduate School of Systems and Information Engineering, University of Tsukuba 2. Center for Computational Sciences, University of Tsukuba

  2. Outline Background OpenACC Compiler Implementation Performance Evaluation Conclusion & Future Work

  3. Background Accelerator programming model CUDA (for NVIDIA GPU) OpenCL (for various accelerators) Accelerator programming is complex memory management, kernel function, low productivity & low portability OpenACC is proposed to solve these problems

  4. OpenACC The directive-based programming model for accelerators support C, C++ and Fortran Offloading model offload a part of code to an accelerator High productivity only adding directives High portability run on any accelerators as long as the compiler supports it

  5. Example of OpenACC int main(){ int i; int a[N], b[N], c[N]; /* initialize array a and b */ #pragma acc parallel loop copyin(a,b) copyout(c) for(i = 0; i < N; i++){ c[i] = a[i] + b[i]; } } This directive specifies data transfers and loop offloading and parallelization

  6. Purpose of Research Designing and implementing an open source OpenACC compiler Target language : C Target accelerator : NVIDIA GPU Source-to-source approach C + OpenACC C + CUDA API This approach enables to leave detailed machine- specific code optimization to the mature CUDA compiler by NVIDIA The result of compilation is a executable file

  7. Related Work Commercial compiler PGI Accelerator compiler CAPS HMPP Cray compiler Open source compiler accULL developed at University of La Laguna in Spain Source-to-source translation Backend is CUDA and OpenCL Output is codes and a Makefile

  8. OpenACC directives parallel kernels loop data host_data update wait cache declare parallel loop kernels loop (OpenACC specification 1.0)

  9. data construct host memory device memory int a[4]; #pragma acc data copy(a) { computation on device /* some codes using a */ } Data management on Accelerator If an array is specified in copy clause 1. Device memory allocation 2. Data transfer from host to device 3. Data transfer from device to host 4. Device memory release at the beginning of region at the end of region

  10. Translation of data construct int a[4]; #pragma acc data copy(a) { /* some codes using a */ } host address device address size . int a[4]; { allocate a on GPU void *_ACC_DEVICE_ADDR_a,*_ACC_HOST_DESC_a; _ACC_gpu_init_data(&_ACC_HOST_DESC_a, &_ACC_DEVICE_ADDR_a, a, 4*sizeof(int)); _ACC_gpu_copy_data(_ACC_HOST_DESC_a, 400); { /* some codes using a */ } _ACC_gpu_copy_data(_ACC_HOST_DESC_a, 401); _ACC_gpu_finalize_data(_ACC_HOST_DESC_b); } copy a to GPU from host copy a to host from GPU free a on GPU

  11. parallel construct #pragma acc parallel num_gangs(1) vector_length(128) { /* codes in parallel region */ } Codes in parallel region are executed on device Three levels of parallelism gang worker vector The number of gang or worker or vector length can be specified by clauses OpenACC CUDA gang thread block worker (warp) vector thread

  12. Translation of parallel construct #pragma acc parallel num_gangs(1) vector_length(128) { /* codes in parallel region */ } __global__ static void _ACC_GPU_FUNC_0_DEVICE( ... ) { /* codes in parallel region */ } extern "C void _ACC_GPU_FUNC_0( ) { dim3 _ACC_block(1, 1, 1), _ACC_thread(128, 1, 1); _ACC_GPU_FUNC_0_DEVICE<<<_ACC_block,_ACC_thread>>>( .. . ); _ACC_GPU_M_BARRIER_KERNEL(); } GPU kernel function kernel launch function

  13. loop construct /* inside parallel region */ #pragma acc loop vector for(i = 0; i < 256; i++){ a[i]++; } Loop construct describes parallelism of loop Distribute loop iteration among gang, worker or vector Two or more parallelisms can be specified for a loop Loops with no loop directive in parallel region is basically executed serially.

  14. Translation of loop construct (1/3) 2. The virtual index is divided and distributed among blocks and/or threads /* inner parallel region */ #pragma acc loop vector for(i = 0; i < N; i++){ a[i]++; } 4 5 6 7 0 1 2 3 8 9 10 11 12 13 14 15 1. A virtual index which is the same length as loop iteration is prepared 3. Each thread calculates the value of loop variable from the virtual index and executes loop body 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5

  15. Translation of loop construct (2/3) /* inner parallel region */ #pragma acc loop vector for(i = 0; i < N; i++){ a[i]++; } virtual index _ACC_idx virtual index range : _ACC_init, cond, step calculate the range of virtual index /* inner gpu kernel code */ int i, _ACC_idx; int _ACC_init, _ACC_cond, _ACC_step; _ACC_gpu_init_thread_x_iter(&_ACC_init, &_ACC_cond, &_ACC_step, 0, N, 1); for(_ACC_idx = _ACC_init; _ACC_idx < _ACC_cond; _ACC_idx += _ACC_step){ _ACC_gpu_calc_idx(_ACC_idx, &i, 0, N, 1); a[i]++; } loop body virtual index range variables calculate i from virtual index

  16. Translation of loop construct(3/3) Our compiler supports 2D blocking for nested loops Nested loops are distributed among the 2D blocks in the 2D grid in CUDA (default block size is 16x16) But it s not allowed in OpenACC 2.0 and tile clause is provided instead 2D Grid #pragma acc loop gang vector for( i = 0; i < N; i++) #pragma acc loop gang vector for(j = 0; j < N; j++) /* */ distribute 2D Block

  17. Compiler Implementation Our compiler translates C with OpenACC directives to C with CUDA API read C code with directives and output translated code using Omni compiler infrastructure Omni compiler infrastructure a set of programs for a source-to-source compiler with code analysis and transformation supports C and Fortran95

  18. Flow of Compilation Omni Frontend OpenACC translator sample.c sample.xml XcodeML C with OpenACC directives Omni compiler infrastructure acc runtime sample _tmp.c C sample_tmp.o a.out compiler C with ACC API sample.cu sample.gpu.o nvcc CUDA

  19. Performance Evaluation Benchmark Matrix multiplication N-body problem NAS Parallel Benchmarks CG Evaluation environment 1 node of Cray XK6m-200 CPU : AMD Opteron Processor 6272 (2.1GHz) GPU : NVIDIA X2090 (MatMul, N-body) : NVIDIA K20 (NPB CG)

  20. Performance Comparison Cray compiler Our compiler Hand written CUDA The code is written in CUDA and compiled by NVCC The code doesn t use shared memory of GPU Our compiler (2D-blocking) The code uses 2D blocking and is compiled by our compiler This is applied to only matrix multiplication

  21. Matrix multiplication 6 5.5x Relative performance against 4.6x 5 1.4x Cray compiler 4 1.5x Hand-written CUDA CPU 3 2 Our compiler 1 Our compiler, 2D- Blocking 0 1K 2K Matrix size 4K 8K The performance of our compiler using 2D-blocking and hand-written CUDA are slightly lower

  22. Matrix multiplication Our compiler achieves better performance than that of Cray compiler The PTX code directly generated by Cray compiler has more operations in the innermost loop Our compiler outputs CUDA code, and NVCC generates more optimized PTX code 2D-blocking is lower performance default 2D block size (16x16) is not adequate to this program the best block size was 512x2 Hand-written CUDA code also uses 16x16 block

  23. N-body 31x 35 1.2x Relative performance against 30 25 20 CPU Cray compiler 15 Hand-written CUDA 5.4x 10 Our compiler 0.95x 5 0 1K 2K The number of particles 4K 8K 16K 32K At the small problem size, the performance of our compiler is lower than that of Cray compiler

  24. N-body At small problem size, the performance became worse Decline in the utilization of Streaming Multiprocessors(SMs) A kernel is executed by SMs per thread block If the number of blocks is smaller than that of SMs, the performance of the kernel becomes low. Default block size Cray compiler : 128 threads / block Our compiler : 256 threads / block

  25. NPB CG 12 9.7x Relative perfomance against CPU 10 8 2.1x 6 4 0.66x Cray compiler 2 0.74x Our compiler 0 Class(Matrix size) the performance is lower than that of CPU and Cray compiler

  26. NPB CG At class S, the performance of GPU is lower than that of CPU Overheads are larger compared with kernel execution time launching kernel functions synchronization with device data allocation / release / transfer The overhead is larger than that of Cray compiler large overhead of reduction The performance of GPU kernels are better than that of Cray compiler

  27. Conclusion We implemented a source-to-source OpenACC compiler for CUDA C with OpenACC directives C with CUDA API Using Omni compiler infrastructure In most case, the performance of GPU code by our compiler is higher than that of CPU single core Speedup of up to 31 times at N-body Our compiler makes use of CUDA backend successfully by source-to-source approach the performance is often better than that of Cray compiler There is room for performance improvement using suitable grid size and block size reducing overhead of synchronization and reduction

  28. Future Work Optimization tuning block size at compile time reducing overhead from synchronization and reduction Support the full set of directives for conforming to OpenACC specification in our compiler We will release our compiler at next SC

Related


More Related Content