Harnessing OpenMP Offloading for Accelerated Computing

1 / 13

Embed Share

Discover the power of OpenMP offloading in Charm++, exploring how to efficiently utilize accelerators in heterogeneous architectures. Learn about ZAXPY implementation on CPU and GPU, compiler support, performance results on K40 and V100, and more to enhance computational efficiency.

bentlie Follow

Uploaded on Apr 19, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Using OpenMP offloading in Charm++ Matthias Diener Charm++ Workshop 2018

OpenMP on accelerators Heterogeneous architectures (CPU + Accelerator) are becoming common Main question: how do we use accelerators? Traditionally: Cuda, OpenCL, OpenMP is an interesting option Supports offloading to accelerators since version 4.0 No code duplication Use standard languages Target different types of accelerators

General overview ZAXPY in OpenMP CPU double x[N], y[N], z[N], a; //calculate z[i]=a*x[i]+y[i] #pragma omp parallel for for (int i=0; i<N; i++) z[i] = a*x[i] + y[i];

General overview ZAXPY in OpenMP GPU Compiler: Generate code for GPU double x[N], y[N], z[N], a; //calculate z=a*x+y #pragma omp target { #pragma omp for for (int i=0; i<N; i++) z[i] = a*x[i] + y[i]; } Runtime: Run code on device if possible, copy data from/to GPU Code is unmodified except for the pragma Data is implicitly copied All calculation done on device

Compiler support Compiler OpenMP offload version Device types Gcc 4.5 Nvidia GPU, Xeon Phi Clang 4.5 Nvidia GPU, AMD GPU Flang n/a icc 4.5 Xeon Phi Cray cc 4.0 Nvidia GPU IBM xl 4.5 Nvidia GPU PGI n/a Limitations: Static linking only Recent linker No C++ exceptions Not all operations offloadable (e.g., I/O, network, ) 5

Performance results K40 ZAXPY, Intel Core Q6600, Nvidia Kepler K40 OMP CPU OMP GPU CUDA , gcc 7.3 10,000,000 Execution time (microseconds, log) 1,000,000 100,000 10,000 1,000 100 10 1 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 Number of vector elements 2^18 2^19 2^20 2^21 2^22 2^23 2^24 2^25 2^26 2^27 2^28 6

Performance results V100 , xl 13.1.7 beta2 ZAXPY, IBM Power 9, Nvidia Volta V100 OMP CPU OMP GPU CUDA 1,000,000 Executoin time (microseconds, log) 100,000 10,000 1,000 100 10 1 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 Number of vector elements 2^18 2^19 2^20 2^21 2^22 2^23 2^24 2^25 2^26 2^27 2^28 2^29 7

Using OpenMP offloading in Charm++/AMPI 8

Using OpenMP offloading in Charm++ Current Charm++ includes LLVM-based OpenMP, but currently without offloading Build Charm++ as usual Build with offloading enabled compiler Do not specify omp option No need to add fopenmp (or similar) options Application Can use OpenMP pragmas directly Need to take care of data consistency for migration Compile with charmc/ampicc with compiler s OpenMP/offloading option charmc -fopenmp file.cpp charmc -qsmp -qoffload file.cpp 9

Example Jacobi3D Modified Jacobi3D application to use OpenMP Run on Ray machine (Power8 + P100), XL 13.1.7 b2 Two input sets: small (100*100*100), large (1000*100*100) CPU Offloaded 0.25 0.2 Iteration time (s) 0.15 0.1 0.05 0 Small input Large input 10

Nvidia Visual Profiler 11

Conclusions and next steps OpenMP provides a simple way to use accelerators Reasonable performance on GPUs compared to Cuda Main challenge: comprehensive compiler support Can be used easily in Charm++/AMPI Next steps Extend integrated LLVM-OpenMP to support offloading Interface with GPU Manager 12

Questions? 13

Harnessing OpenMP Offloading for Accelerated Computing

Download Presentation

Presentation Transcript

Related

More Related Content