Designing In-network Computing Aware Reduction Collectives in MPI

Download Presenatation
Designing In-network Computing Aware Reduction Collectives in  MPI
Slide Note
Embed
Share

In this presentation at SC'23, discover how in-network computing optimizes MPI reduction collectives for HPC/DL applications. Explore SHARP protocol for hierarchical aggregation and reduction, shared memory collectives, and the benefits of offloading operations to network devices. Learn about modern HPC cluster architectures leveraging high-performance interconnects, multi/many-core processors, accelerators, and RDMA-enabled networking for enhanced efficiency and scalability.

  • In-Network Computing
  • MPI Collectives
  • HPC Architecture
  • SHARP Protocol
  • Offloading Operations

Uploaded on Feb 24, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Designing In-network Computing Aware Reduction Collectives in MPI Presentation at OSU booth at SC 23 Bharath Ramesh The Ohio State University ramesh.113@osu.edu Follow us on https://twitter.com/mvapich

  2. Introduction: Drivers of Modern HPC Cluster Architectures High Performance Interconnects InfiniBand <1usec latency, 200-400Gbps Bandwidth> Multi-/Many-core Processors Accelerators SSD, NVMe-SSD, NVRAM high compute density, high performance/watt >9.7 TFlop DP on a chip Multi-core/many-core technologies Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand, RoCE, Slingshot) Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD Accelerators (NVIDIA GPGPUs) Summit Frontier Lumi Fugaku 2 Network Based Computing Laboratory SC 23 OSU Booth

  3. MPI Reduction collectives and In-network Computing Reduction collectives (such as MPI_Allreduce) are important for HPC/DL Involve both compute and communication Using CPUs everywhere leads to sub-optimal scale-up and scale-out efficiency Motivates the need for offloading common operations away from the CPU to allow the CPU to perform other useful tasks In-network compute allows offloading operations to network devices Switches are a good candidate due to high bandwidth and ability to reduce data on- the-fly eliminating redundancy High scale-up efficiency and network topology awareness Frees up CPU cycles for other operations 3 Network Based Computing Laboratory SC 23 OSU Booth

  4. What is SHARP? Scalable Hierarchical Aggregation and Reduction Protocol Advantages Progress, offload computation and communication Focuses on low latency for small messages, maximal bandwidth utilization for large messages Hierarchical Aggregation in a logical tree (LLT) providing low latency for small messages Streaming aggregation with pipelined ring-based algorithms for large messages In-Network computing Focus on creating groups, which can simultaneously execute operations 4 Network Based Computing Laboratory SC 23 OSU Booth

  5. Shared memory collectives Two-copy mechanism Cache-aligned Created using the mmap system call Needs synchronization for multi-process scenarios Effective for relatively smaller message sizes 5 Network Based Computing Laboratory SC 23 OSU Booth

  6. Design for small message reductions Initialize sharp reduce spec and groups for node level leaders Implementation First, do a shared memory reduce to a root process on each node For the inter-node step, call allreduce over the leader communicator Root process receives into recvbuf If Reduce, we stop here and discard results on other processes If Allreduce, do an intra-node broadcast Uses the existing allreduce SHARP primitive since reduce hasn t been implemented in the SHARP APIs 6 Network Based Computing Laboratory SC 23 OSU Booth

  7. Design for MPI_Barrier Similar to the implementation for MPI_Allreduce First, each socket-level leader uses flags to sense arrival of other processes in the socket Second, a node-level leader senses arrival of socket-level leaders Third, execute a full pairwise Barrier exchange/SHARP barrier for inter-node Fourth, a node-level leader uses flags to notify other socket level leaders Finally, each socket-level leader uses flags to notify other processes in the socket 7 Network Based Computing Laboratory SC 23 OSU Booth

  8. Results for MPI_Reduce Varying message sizes 1 ppn, 7861 nodes 16 ppn, 1024 nodes Scaling with message size, max latency Close to a flat curve across message sizes up to 2K 8 Network Based Computing Laboratory SC 23 OSU Booth

  9. Results for MPI_Reduce Varying node counts 1 ppn 16ppn Scaling with increasing node counts, 16 bytes, max latency Up to 5.4X improvement for 1ppn, 7861 nodes Up to 2.6X improvement for 16ppn, 1024 nodes 9 Network Based Computing Laboratory SC 23 OSU Booth

  10. Results for MPI_Allreduce Varying message sizes 1 ppn, 7861 nodes 16 ppn, 1024 nodes Scaling with message size, average latency Close to a flat curve across message sizes up to 2K 10 Network Based Computing Laboratory SC 23 OSU Booth

  11. Results for MPI_Allreduce Varying node counts 1 ppn, 7861 nodes 16 ppn, 1024 nodes Scaling with increasing node counts, 16 bytes, average latency Same as trends with reduce (implementations are almost the same except for the intra-node broadcast phases) 11 Network Based Computing Laboratory SC 23 OSU Booth

  12. SHARP Reduction trees and Streaming Aggregation (SAT) Aggregation Tree Switch-level reduction (radix 16) Images taken from Graham, Richard et al. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation. DOI : 10.1007/978-3-030-50743-5_3 12 Network Based Computing Laboratory SC 23 OSU Booth

  13. Large message reduction collectives Prior work on reduction collectives with SHARP Used leader-based schemes that had a reduction, followed by a SHARP operation and finally a broadcast Not suitable for message sizes >= 8K Single-copy schemes are very efficient for large message data movement XPMEM allows remote process to have load/store access through address space mapping Motivates the need for large message reduction designs that combine advantages of SHARP and single-copy schemes like XPMEM 13 Network Based Computing Laboratory SC 23 OSU Booth

  14. Proposed Design Overview Use a registration cache to amortize registration costs in the SHARP runtime Designate a leader process on each node to interact with SHARP Chunk buffer into PPN (number of processes per node) chunks and reduce to a single buffer belonging to the leader process Uses XPMEM for load/store access All processes perform local reductions, but only leader process calls the SHARP runtime Once local reductions are complete, leader calls a non-blocking MPI_Allreduce Perfect overlap of intra-node and inter-node steps Local reduction happens in batches for good network utilization Final result broadcasted within the node 14 Network Based Computing Laboratory SC 23 OSU Booth

  15. Registration cache design Use an AVL tree or similar, to store buffer addresses Impact of registration cache designs 10000 O(log n) insertion/query time 1000 Latency (us) If buffer entry exists, directly get registration information from cache 100 10 1 Up to 5.6X reduction in latency 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M Message Size (bytes) SAT-with-registration-cache SAT-without-registration-cache 15 Network Based Computing Laboratory SC 23 OSU Booth

  16. Experimental setup Cluster MRI HPCAC Processor model AMD EPYC 7713 Intel(R) Xeon(R) Gold 6138 Max Clock speed 3.72GHz 2GHz Number of sockets 2 2 Cores per socket 64 20 RAM 256GB 196GB Interconnect NVIDIA HDR-200 with Quantum 2 switches NVIDIA HDR-200 with Quantum 2 switches MPI libraries MVAPICH2-X, HPC-X MVAPICH2-X, HPC-X 16 Network Based Computing Laboratory SC 23 OSU Booth

  17. MPI_Allreduce on MRI 8 nodes 32 PPN 64 PPN Increased parallelism by using multiple processes and SHARP for reduction Up to 78.36% over state-of-the-art for 64PPN and 79.44% for 32PPN 17 Network Based Computing Laboratory SC 23 OSU Booth

  18. MPI_Allreduce on HPCAC 8 nodes 32 PPN 64 PPN Increased parallelism by using multiple processes and SHARP for reduction Up to 52.13% over state-of-the-art for 64PPN and 58.08% for 32PPN 18 Network Based Computing Laboratory SC 23 OSU Booth

  19. Conclusion and Future Work SHARP runtime enables in-network offload with excellent bandwidth utilization Small message designs can be used in MVAPICH 3.0/MVAPICH 2.3.7 Large message designs will be available in a future release of MVAPICH-plus 19 Network Based Computing Laboratory SC 23 OSU Booth

  20. THANK YOU! Network Based Computing Laboratory Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ 20 Network Based Computing Laboratory SC 23 OSU Booth

Related


More Related Content