Enhancing HPC Performance with Broadcom RoCE MPI Library

Slide Note
Embed
Share

This project focuses on optimizing MPI communication operations using Broadcom RoCE technology for high-performance computing applications. It discusses the benefits of RoCE for HPC, the goal of highly optimized MPI for Broadcom RoCEv2, and the overview of the MVAPICH2 Project, a high-performance open-source MPI library supporting various interconnects and platforms.


Uploaded on Oct 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. High Performance & Scalable MPI library over Broadcom RoCE Mustafa Abduljabbar abduljabbar.1osu.edu 11/12/2023 Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

  2. Introduction https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/RDMA-over-Converged-Ethernet.html 2 Network Based Computing Laboratory SC 23

  3. Why RoCE for HPC? Enhanced Performance: RoCE delivers significantly lower latency and higher throughput compared to traditional Ethernet, empowering HPC applications to achieve peak performance. Optimized Efficiency: RoCE offloads RDMA operations to specialized hardware, reducing CPU overhead and freeing up valuable processing resources for critical computational tasks. Cost-Effective Solution: RoCE leverages existing Ethernet infrastructure, minimizing upfront investment and simplifying network management. Scalable and Flexible: RoCE supports a range of Ethernet speeds and Layer 3 routing, enabling scalability and flexibility to meet evolving HPC demands. Emerging HPC Interconnect Standard: RoCE is gaining widespread adoption in the HPC community, recognized for its ability to meet the stringent performance requirements of next- generation HPC systems. 3 Network Based Computing Laboratory SC 23

  4. Goal: Highly optimized MPI for Broadcom RoCEv2 1. MVAPICH-CPU release: Optimizing MPI communication operations on new generation Broadcom adapters We provide support for newer generation Broadcom network adapters (Thor 200 Gbps) in MVAPICH2 and optimize the communication protocols (RC, UD, Hybrid) Focus will be towards point-to-point operations (two-sided) and frequently used collective operations (such as Allreduce and Alltoall). Benefits of these designs will be studied at the applications level. These design changes will be incorporated into the future MVAPICH release. 2. MVAPICH-GPU release: Exploring the use of GPUDirect capabilities in new Broadcom adapters for high- performance data transfers to/from GPU device memory Broadcom has introduced support for GPUDirect RDMA to enable high-performance communication operations from device memory. We study and evaluate the performance of Broadcom s GPUDirect technology with Thor adapters. We explore designs in MVAPICH2-GDR for accelerating relevant portions of device-based communication operations using GPUDirect technology with Thor adapters. The focus will be on point-to-point intra-node, inter-node, and commonly used collectives (Allreduce and Alltoall). The designs will be incorporated into the future MVAPICH2-GDR release. 4 Network Based Computing Laboratory SC 23

  5. Overview of the MVAPICH2 Project High Performance open-source MPI Library Support for multiple interconnects InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), AWS EFA, OPX, Broadcom RoCE, Intel Ethernet, Rockport Networks, Slingshot 10/11 Support for multiple platforms x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD) Used by more than 3,325 organizations in 90 countries Started in 2001, first open-source version demonstrated at SC 02 More than 1.73 Million downloads from the OSU site Supports the latest MPI-3.1 standard directly http://mvapich.cse.ohio-state.edu Empowering many TOP500 clusters (Nov 23 ranking) Additional optimized versions for different systems/environments: 11th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China MVAPICH2-X (Advanced MPI + PGAS), since 2011 MVAPICH2-GDR with support for NVIDIA (since 2014) and AMD (since 2020) GPUs 29th , 448, 448 cores (Frontera) at TACC MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 46th, 288,288 cores (Lassen) at LLNL MVAPICH2-Virt with virtualization support, since 2015 61st, 570,020 cores (Nurion) in South Korea and many others MVAPICH2-EA with support for Energy-Awareness, since 2015 Available with software stacks of many vendors and Linux MVAPICH2-Azure for Azure HPC IB instances, since 2019 Distros (RedHat, SuSE, OpenHPC, and Spack) MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019 Tools: Partner in the 21st ranked TACC Frontera system OSU MPI Micro-Benchmarks (OMB), since 2003 Empowering Top500 systems for more than 16 years OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015 5 Network Based Computing Laboratory SC 23

  6. Overview Introduction Performance Characterization MPI performance overheads vs. IB level Performance Optimization Performance Evaluation Micro-benchmark level Application level MVAPICH 3.0 RC Performance Evaluation 6 Network Based Computing Laboratory SC 23

  7. Cluster Setup * *Courtesy of DELL Technology 8 Network Based Computing Laboratory SC 23

  8. RDMA Protocol Performance Characterization on 100 GbE RC UD Pt2pt Latency Pt2pt Latency 400 9 IB level IB lvl 8 350 MVAPICH2 7 Latency (Microseconds) Latency (Microseconds) 300 5.34us 6 250 5 200 On-par with IB-verbs 4 150 3 3.82us 100 2 50 3.93us 1 0 0 2M 32 1K 2K 4K 8K 16 64 1M 4M 2 4 8 128 256 512 16K 32K 64K 128K 256K 512K 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (Bytes) Message Size (Bytes) 10 Network Based Computing Laboratory SC 23

  9. MPI level Overhead Point-to-point Latency UD Overhead RC Overhead Latency Overheads Latency Overheads 3000 16.86us 18000 2420ns MV2 Overhead 16000 MV2 Overhead 2500 Overhead (Nanoseconds) Overhead (Nanoseconds) 14000 2000 12000 10000 1500 1520ns 8000 1000 6000 4000 500 2000 140ns 0 0 2 4 8 16 32 64 128 256 512 1K 2K 4K 1M 16 32 64 1K 2K 4K 8K 2M 4M 2 4 8 128 256 512 128K 256K 512K 16K 32K 64K Message Size (Bytes) Message Size (Bytes) 11 Network Based Computing Laboratory SC 23

  10. Overview Introduction Performance Characterization MPI performance overheads vs. IB level Performance Optimization Performance Evaluation Micro-benchmark level Application level MVAPICH 3.0 RC Performance Evaluation 12 Network Based Computing Laboratory SC 23

  11. Performance Optimization High-level description of optimization efforts: Add corresponding point-to-point & collective tuning tables For up to 64 nodes x 128 PPN = Based on Dell Bluebonnet (CPU) system and Rattler2 (GPU) system Enhanced UD+RC hybrid transport mode tuned for Broadcom adapter Optimized default CPU mapping policy Support for asynchronous threading progress Startup Optimization Point-to-point Message Coalescing SGL packetized eager communication 13 Network Based Computing Laboratory SC 23

  12. UD + RC Transport Protocol Analysis osu_latency 1000 MVAPICH2 RC RC has better performance vs. UD in most cases MVAPICH2 UD Latency (Microseconds) 100 UD becomes exclusive on large scales (e.g. alltoall with >= 16 nodes) 10 Tuned hybrid transport mode 1 Use RC for small scale & message sizes 64 16 32 8K 1M 1K 2K 4K 2M 4M 1 2 4 8 16K 32K 64K 128 256 512 128K 256K 512K Use UD for the other cases Message Size (Bytes) Allreduce - 16 Nodes, 128 PPN osu_alltoall 16 Nodes, 128 PPN 10000 1000000 MVAPICH2 RC 800000 MVAPICH2 UD 1000 Latency (Microseconds) Latency (Microseconds) 600000 100 400000 10 200000 1 0 1K 2K 4K 8K 16 32 64 1M 4 8 16K 32K 64K 128 Message Size (Bytes) 256 512 128K 256K 512K 256 512 1K 2K 4K 8K 16K Message Size (Bytes) 14 Network Based Computing Laboratory SC 23

  13. Performance Optimization High-level description of optimization efforts: Add corresponding point-to-point & collective tuning tables Enhanced UD+RC hybrid transport mode tuned for Broadcom adapter Optimized default CPU mapping policy Make hybrid spread CPU mapping policy as default Support for asynchronous threading progress UD Startup Optimization Point-to-point Message Coalescing SGL packetized eager communication 15 Network Based Computing Laboratory SC 23

  14. UD Startup Optimization 64NodeUDStartup 4 Node UD Startup 10000 8057 4000 ud-head ud-fix Startup Time (ms) 7962 3341 Startup Time (ms) ud-head ud-fix 8000 3120 3000 4737 6000 4645 3492 2000 1719 4000 2954 1496 1884 1738 438 1049 817 2000 770 477 819 764 654 1000 531 245 204 585 528 475 436 425 340 0 226 190 184 0 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 PPN PPN Provide up to 2.3x faster UD startup in small 4 nodes scale Provide up to 2.1x faster UD startup in large 64 nodes scale 16 Network Based Computing Laboratory SC 23

  15. Pt-to-Pt Coalescing Performance osu_bw osu_bibw 14000 25000 enable_coalesce enable_coalesce 12000 20000 Bandwidth (MB/s) Bandwidth (MB/s) disable disable 10000 15000 2.7x 8000 1.6x 6000 10000 4000 5000 2000 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) Message Size (Bytes) - Enabling/disabling coalescing has an impact on BW and BiBW - Coalescing effective up to 1K message size - Up to 1.6x higher bandwidth, 2.7x higher bi-bandwidth with medium sized messages 17 Network Based Computing Laboratory SC 23

  16. Single-Pair Message Rate - Coalescing Performance osu_mbw_mr Test Name: Single-Pair Bandwidth and Message Rate Test Pt2pt Message Rate 1 PPN 4000 Evaluation Focus: Aggregate uni- directional bandwidth and message rate enable_coalesce 3500 Message Rate (x1000 Messages/s) disable 3000 2500 Participants: 1 process per node 1.5x 2000 Sending Process Behavior: 1500 Sends a fixed number of messages (window size) 1000 500 Sends messages back-to-back to the paired receiving process 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) Waits for a reply from the receiver Iterations: Repeated for several iterations 1. Enabling/disabling coalescing has an impact on BW and BiBW (Coalescing effective up to 1K message size) Up to 1.5x higher bandwidth with limits 2. 18 Network Based Computing Laboratory SC 23

  17. SGL packetized eager communication 100 GbE osu_alltoall 2 x 128 procs 500 Disable 450 Reduce up to 16% alltoall latency for 4 bytes messages size 400 Enable Latency (us) 350 300 250 200 Enable by adding MV2_USE_EAGER_SGL=1 runtime parameter 150 100 50 0 1 2 4 8 16 32 64 128 256 512 1K osu_alltoall 2 x 128 procs 10000 Enabled by default for 1B ~ 1KB message sizes Disable Enable Latency (us) 1000 100 10 1 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes) 19 Network Based Computing Laboratory SC 23

  18. Overview Introduction Performance Characterization Performance Optimization Performance Evaluation Micro-benchmark level Application level MVAPICH 3.0 RC Performance Evaluation 20 Network Based Computing Laboratory SC 23

  19. Performance Evaluation Micro-benchmarks Experiment results from Dell Bluebonnet osu_latency Small Messages osu_bw Large Messages 8 15000 Up to 20% reduction in small message point-to- point latency Bandwidth (MB/s) 6 Latency (Microseconds) 10000 4 2.3.x-broadcom 5000 From 0.1x to 2x increase in bandwidth OpenMPI 2 2.3.7 Up to 12.4x lower MPI_Allreduce latency 0 0 1 2 4 8 128 256 512 16K 32K 64K 16 32 64 1K 2K 4K 4K 8K 1M 2M 4M 128K 256K 512K Up to 5x lower MPI_Scatter latency Message Size (Bytes) Message Size (Bytes) Allreduce - 64 Nodes, 128 PPN Scatter - 64 Nodes, 128 PPN Alltoall - 64 Nodes, 128 PPN 10000 100000 10000000 1000000 10000 1000 Latency (Microseconds) 100000 Latency (Microseconds) Latency (Microseconds) 1000 10000 100 1000 100 2.3.x-broadcom 100 OpenMPI 10 10 10 2.3.7 1 1 1 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 1 2 4 8 1K 2K 4K 8K 16 32 64 128 256 512 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) 21 Network Based Computing Laboratory SC 23

  20. Performance Evaluation Micro-benchmarks Experiment results from Rattler2 osu_latency Small Messages osu_bw Large Messages 25 15000 Up to 53% reduction in medium message point-to-point latency 20 Latency (Microseconds) Bandwidth (MB/s) 10000 15 10 5000 Up to 2.6x increase in bandwidth 5 Up to 35% reduction in alltoall latency 0 0 1 2 4 8 128 256 512 16K 32K 64K 1M 2M 4M 128K 256K 512K 16 32 64 1K 2K 4K 4K 8K Message Size (Bytes) Message Size (Bytes) Alltoall - 2 Nodes, 4 PPN Allreduce - 2 Nodes, 4 PPN Bcast - 2 Nodes, 4 PPN 1000 10000 10000 MV2GDR Opt Latency (Microseconds) 1000 Latency (Microseconds) 1000 OpenMPI Latency (Microseconds) 100 MV2GDR 100 100 10 10 10 1 1 1 1 4 16 64 256 1K 4K 16K 64K 256K 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K Message Size (Bytes) 4K 16K 64K 256K 1M Message Size (Bytes) Message Size (Bytes) 22 Network Based Computing Laboratory SC 23

  21. Performance Evaluation Applications OpenFOAM GROMACS - benchPEP - 128 PPN 90x36x36 (15.5M cells) Motorbike 128 PPN 300 600 246 495 494 234 250 473 500 224 200 400 Seconds 153 Seconds 145 140 150 300 232 107 225 102 214 99 94 92 90 100 80 77 200 61 56 55 53 52 50 111 100 41 95 92 50 81 77 100 61 61 42 0 0 1 2 4 8 16 32 64 1 2 4 8 16 Nodes Nodes 2.3.x-broadcom 2.3.7 OpenMPI 2.3.x-broadcom 2.3.7 OpenMPI Reduce up to 45% execution time of OpenFOAM Motorbike on 16 nodes 128 PPN scale Reduce up to 51% execution time of GROMACS benchPEP on 64 nodes 128 PPN scale 23 Network Based Computing Laboratory SC 23

  22. Performance Evaluation Applications WRF CONUS 12KM 128 PPN CP2K H2O-dft-ls (NREP4) 128 PPN 250 210 500 2.3.x-broadcom OpenMPI 2.3.7 203 202 438 428 200 450 Execution Time (Sec) 400 Execution Time (Sec) 150 122 321 350 115 113 306 297 296 273 269 300 100 255 77 73 65 250 51 49 43 50 200 150 0 100 1*128 2*128 4*128 8*128 50 # Nodes * PPN 0 WRF CONUS 3KM 128 PPN 2x128 4x128 8x128 16x128 32x128 1127 #. Nodes * PPN 1200 1095 1021 1000 Execution Time (Sec) MVAPICH2 OpenMPI 800 579 563 543 600 Reduce up to 45% execution time of CP2K H2O-dft-ls (NREP4) 348 321 400 303.2 207 197 200 Reduce up to 7% execution time of WRF CONUS 3KM 0 8*128 16*128 32*128 64*128 # Nodes * PPN 24 Network Based Computing Laboratory SC 23

  23. Overview Introduction Performance Characterization Performance Optimization Performance Evaluation Micro-benchmark level Application level MVAPICH 3.0 RC Performance Evaluation 25 Network Based Computing Laboratory SC 23

  24. MVAPICH-3.0 Pt-to-Pt Latency (RC) on FW 227 (RHEL 8.8) Latency Small Message Latency Large Message 9 400 MV3.0 MV3.0 8 350 MV2.3.X MV2.3.X Latency (us) Latency (us) 7 300 OMPI OMPI 6 250 5 200 4 150 3 100 2 50 1 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) MVAPICH 3.0 provides competitive point-to-point performance Reduce 9% latency with 16KB message size 26 Network Based Computing Laboratory SC 23

  25. MVAPICH-3.0 Pt-to-Pt Latency (UD) on FW 227 (RHEL 8.8) Latency Small Message Latency Large Message 10 500 MV3.0 MV3.0 9 450 8 400 Latency (us) Latency (us) MV2.3.X MV2.3.X 7 350 6 300 5 250 4 200 3 150 2 100 1 50 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) MVAPICH 3.0 provides competitive point-to-point UD performance Reduce 28% latency with 16KB message size 27 Network Based Computing Laboratory SC 23

  26. Conclusion & Future Work Conclusion: Analyze MPI overheads vs. IB level performance on Broadcom adapter Significant microbenchmark and application level gains Suc Future Work: Optimize additional applications Integrate existing optimizations with MVAPICH-3.0 on Broadcom systems In progress: MVAPICH-2.3.8 (with enhanced RoCEv2 support) 28 Network Based Computing Laboratory SC 23

  27. THANK YOU! Network Based Computing Laboratory Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ 29 Network Based Computing Laboratory SC 23

Related


More Related Content