Fast Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments
This research focuses on enabling efficient and fast noncontiguous data movement between GPUs in hybrid MPI+GPU environments. The study explores techniques such as MPI-derived data types to facilitate noncontiguous message passing and improve communication performance in GPU-accelerated systems. By leveraging OpenCL and CUDA accelerators, the approach aims to optimize data movement within MPI for enhanced system performance. The research also investigates resource contention scenarios for data packing on GPUs, offering a fine-grained parallel and GPU-optimized processing algorithm for data in GPU memory. The study's contributions include a methodology for processing MPI data types in GPU memory and investigating resource contention for GPU-based packing scenarios. Overall, the work aims to enhance GPU data movement efficiency and performance in complex computing environments.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments John Jenkins, Nagiza Samatova - North Carolina State University James Dinan, Pavan Balaji, Rajeev Thakur - Argonne National Laboratory IEEE Cluster 2012 Contact: Pavan Balaji (balaji@mcs.anl.gov)
Motivating Example Noncontiguous Message Passing Enabled through MPI Derived Datatypes Example: Halo exchange GPU Matrix GPU Matrix PCIe communication CPU CPU GPU, CPU memory distinct! Network comm. GPU Matrix GPU Matrix CPU CPU Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 2 September 27, 2012
Multiple accel. (OpenCL and CUDA) MPI-ACC Accelerator-aware Data Movement Within MPI Portably leverage system and vendor-specific optimizations -Attribute-driven -NUMA-aff. aware -PCIe-aff. aware -OpenCL handle caching -I/O Hub Aware, -IPC handle reuse, -Low overhead Intra-node2,3 Inter-node1 Noncontiguous4 Contiguous 1. Aji et al. MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems. HPCC '12. Ji et al. DMA-Assisted, Intranode Communication in GPU Accelerated Systems. HPCC '12. Ji et al. Efficient Intranode Communication in GPU-Accelerated Systems. ASHES '12 Workshop. Jenkins et al. Enabling Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments. Cluster '12. 2. 3. 4. Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 3 September 27, 2012
Contributions MPI datatype processing algorithm for data in GPU memory Generalized to any datatype Fine-grained parallel, GPU-optimized Investigate resource contention scenarios for packing on the GPU, for both DMA- based and kernel-based packing. Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 4 September 27, 2012
GPU Memory Spaces, Interconnect SM SM SM SM shared mem. shared mem. shared mem. shared mem. L2 Cache Global Memory (GDDR_) PCIe Controller NIC RAM CPU Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 5 September 27, 2012
Datatypes Example: Vector Communication, I/O on noncontiguous data MPI_Type_vector(count, blocklength, stride, oldtype, newtype) Parameters: count number of blocks blocklength number of contiguous oldtype per block stride distance between elements (wrt oldtype or bytes) MPI_Type_vector( 4, 1, 3, v0, v1) PCIe, network, disk utilization Pack MPI_Type_vector( 3, 1, 2, DOUBLE, v0) Extent Size Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 6 September 27, 2012
GPU Datatype Processing Goals Characteristic CPU Processing Comparison Fine-grain parallel Inherently serial No inter-thread dependencies Stack-based packing state Computation/transfer overlap Same, but at network level Compact datatype representation Tree-based Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 7 September 27, 2012
Datatype Processing Algorithm Key insight: we can determine where each element (double, int, etc.) is using only datatype encoding + number of primitive elements per datatype! input: element (thread) ID, output: read, write offsets read_offset = 0 write_offset = 0 read_offset += f(ID, type) write_offset += g(ID,type) load child type No How to compute f, g? is type leaf? End Yes Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 8 September 27, 2012
Example Offset Computation - Vector block child input buffer offset_in Same computation regardless of datatype composition! output buffer offset_out child_offset = ID / type.child_elements block_offset = child_offset / type.blocklength offset_in += block_offset * type.stride + (child_offset%type.blocklength) * type.extent offset_out += child_offset * type.size ID = ID % type.child_elements child need not know about parent! Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 9 September 27, 2012
GPU Datatypes Representation Type Fixed Variable Common count, size, extent, #primitives Contig Vector stride, blocklength Indexed lookaside offset blocklengths, displacements Struct lookaside offset blocklengths, displacements, types, child type IDs Subarray dimensions, lookaside offset start offsets, sizes, subsizes inorder buffer lookaside buffer cache in shared memory cache if enough space Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 10 November 24, 2024
Datatype Processing Summary by Type Datatype Properties Complexity Notes contiguous chunk of datatypes O(1) vector strided array of datatypes O(1) subarray n-dimensional matrix O(n) compute flat array index indexed, struct b pairs of blocklengths, displacements O(log(b)) binary search Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 11 September 27, 2012
Benchmarking - Datatypes Datatype CUDA Impl. Notes 2D-vector* cudaMemcpy2D Common 4D-subarray Iterative cudaMemcpy3D Cannot be represented by vector type indexed* cudaMemcpy per-block Irregular, does not map to CUDA well C-style struct cudaMemcpy per-extent Thread branch divergence on read/write * aligned on CUDA-optimized byte-boundaries Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 12 September 27, 2012
Comparison CUDA DMA Vector depends on parameterization Subarray latency aggregated in CUDA calls Irregular types huge speedup (no reasonable CUDA equivalent) Overhead a few s Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 13 September 27, 2012
Comparison Vector Parameterizations CUDA performs best on multiples of 64 bytes CUDA performs poorly otherwise Same goes for vector stride (not shown) Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 14 September 27, 2012
Comparison Vector Communication (Ping Pong) CUDA DMA best for small buffers, large blocks. NOTE: data laid out to be CUDA-optimal Packing kernel best for larger buffers, small blocks MVAPICH serialized packing, PCIe (ours fully pipelined through zero- copy) Performance doesn t carry over to other datatypes tied to CUDA DMA Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 15 September 27, 2012
Resource Contention Contention Point! PCI-e (full duplex) SMs CPU GPU Contention Point??? Contention Point! Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 16 September 27, 2012
Resource Contention SM CPU -> GPU GPU -> CPU Packing kernel Yes Somewhat* Yes** CUDA No Somewhat* Yes * shouldn't happen (scheduler artifact?) ** PCIe transactions driven by SMs (zero copy) treated more favorably Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 17 September 27, 2012
Results Summary Use packing for irregular types (Order of Magnitude Speedup) large sparse transfers (Order of Magnitude Speedup) types which do not adhere to CUDA-optimized memory layouts Use CUDA DMA for small transfers, 2D, 3D arrays with large, contiguous chunks. Use hand-coded packing kernels for small sized, simple types Datatypes implementations can control for these cases! Packing is complementary rather than competing Contact: Pavan Balaji (balaji@mcs.anl.gov) Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments 18 September 27, 2012