Enhancing HPC Performance with Broadcom RoCE MPI Library

undefined

High Performance & Scalable MPI library

over Broadcom RoCE

11/12/2023

Network-based Computing Laboratory

Department of Computer Science and Engineering

The Ohio State University

Mustafa Abduljabbar

abduljabbar.1osu.edu

https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/RDMA-over-Converged-Ethernet.html

Introduction

•

Enhanced Performance: RoCE delivers significantly lower latency and higher throughput

compared to traditional Ethernet, empowering HPC applications to achieve peak performance.

•

Optimized Efficiency: RoCE offloads RDMA operations to specialized hardware, reducing CPU

overhead and freeing up valuable processing resources for critical computational tasks.

•

Cost-Effective Solution: RoCE leverages existing Ethernet infrastructure, minimizing upfront

investment and simplifying network management.

•

Scalable and Flexible: RoCE supports a range of Ethernet speeds and Layer 3 routing, enabling

scalability and flexibility to meet evolving HPC demands.

•

Emerging HPC Interconnect Standard: RoCE is gaining widespread adoption in the HPC

community, recognized for its ability to meet the stringent performance requirements of next-

generation HPC systems.

Why RoCE for HPC?

1.

MVAPICH-CPU release: Optimizing MPI communication operations on new generation Broadcom

adapters

•

We provide support for newer generation Broadcom network adapters (Thor 200 Gbps) in MVAPICH2 and optimize

the communication protocols (RC, UD, Hybrid)

•

Focus will be towards point-to-point operations (two-sided) and frequently used collective operations (such as

Allreduce and Alltoall).

•

Benefits of these designs will be studied at the applications level.

•

These design changes will be incorporated into the future MVAPICH release.

2.

MVAPICH-GPU release: Exploring the use of GPUDirect capabilities in new Broadcom adapters for high-

performance data transfers to/from GPU device memory

•

Broadcom has introduced support for GPUDirect RDMA to enable high-performance communication operations

from device memory.

•

We study and evaluate the performance of Broadcom’s GPUDirect technology with Thor adapters.

•

We explore designs in MVAPICH2-GDR for accelerating relevant portions of device-based communication operations

using GPUDirect technology with Thor adapters. The focus will be on point-to-point intra-node, inter-node, and

commonly used collectives (Allreduce and Alltoall).

•

The designs will be incorporated into the future MVAPICH2-GDR release.

Goal: Highly optimized MPI for Broadcom RoCEv2

Overview of the MVAPICH2 Project

•

High Performance open-source MPI Library

•

Support for multiple interconnects

–

InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE),  AWS

EFA, OPX, Broadcom RoCE, Intel Ethernet, Rockport Networks, Slingshot 10/11

•

Support for multiple platforms

–

x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD)

•

Started in 2001, first open-source version demonstrated at SC ‘02

•

Supports the latest MPI-3.1 standard

•

http://mvapich.cse.ohio-state.edu

•

Additional optimized versions for different systems/environments:

–

MVAPICH2-X (Advanced MPI + PGAS), since 2011

–

MVAPICH2-GDR with support for NVIDIA (since 2014) and AMD (since 2020) GPUs

–

MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014

–

MVAPICH2-Virt with virtualization support, since 2015

–

MVAPICH2-EA with support for Energy-Awareness, since 2015

–

MVAPICH2-Azure for Azure HPC IB instances, since 2019

–

MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019

•

Tools:

–

OSU MPI Micro-Benchmarks (OMB), since 2003

–

OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015

•

Used by more than 3,325 organizations in 90 countries

•

More than 1.73 Million downloads from the OSU site

directly

•

Empowering many TOP500 clusters (Nov ‘23 ranking)

–

th

 , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China

–

th

 , 448, 448 cores (Frontera) at TACC

–

th

, 288,288 cores (Lassen) at LLNL

–

st

, 570,020 cores (Nurion) in South Korea

and many others

•

Available with software stacks of many vendors and Linux

Distros (RedHat, SuSE, OpenHPC, and Spack)

•

Partner in the 21

st

 ranked TACC Frontera system

•

Empowering Top500 systems for more than 16 years

Overview

•

Introduction

•

Performance Characterization

–

MPI performance overheads vs. IB level

•

Performance Optimization

•

Performance Evaluation

–

Micro-benchmark level

–

Application level

•

MVAPICH 3.0 RC Performance Evaluation

•

MVAPICH2 Runtime:

–

RC: MV2_USE_UD_HYBRID=0 MV2_USE_ONLY_UD=0

–

UD: MV2_USE_UD_HYBRID=0 MV2_USE_ONLY_UD=1

•

UCX 1.12.1:

–

./configure --prefix=<

UCX_INSTALL_PATH

•

OpenMPI 4.1.4 (w/ UCX 1.12.1):

–

./configure –prefix=<

INSTALL_PATH

> --with-ucx=<

UCX_INSTALL_PATH

•

OpenMPI Runtime:

–

 mpirun -np <NP> -npernode <PPN> -hostfile hosts --mca pml ucx –x UCX_TLS=self,sm,rc_v

/path/to/cp2k.popt -i /path/to/inputfile

Configuration & Runtime

Cluster Setup

*Courtesy of DELL Technology

CPU and NIC Setup

RDMA Protocol Performance Characterization on 100

GbE

RC

UD

3.93us

3.82us

5.34us

On-par with IB-verbs

MPI level Overhead – Point-to-point Latency

RC

 Overhead

140ns

16.86us

UD

 Overhead

1520ns

2420ns

Overview

•

Introduction

•

Performance Characterization

–

MPI performance overheads vs. IB level

•

Performance Optimization

•

Performance Evaluation

–

Micro-benchmark level

–

Application level

•

MVAPICH 3.0 RC Performance Evaluation

•

High-level description of optimization efforts:

–

Add corresponding point-to-point & collective tuning tables

•

For up to 64 nodes x 128 PPN =

•

Based on Dell

 Bluebonnet (CPU) system and Rattler2 (GPU) system

–

Enhanced UD+RC hybrid transport mode tuned for Broadcom adapter

–

Optimized default CPU mapping policy

–

Support for asynchronous threading progress

–

Startup Optimization

–

Point-to-point Message Coalescing

–

SGL packetized eager communication

Performance Optimization

•

RC has better performance vs. UD in most cases

•

UD becomes exclusive on large scales (e.g. alltoall

with >= 16 nodes)

•

Tuned hybrid transport mode

–

Use RC for small scale & message sizes

–

Use UD for the other cases

UD + RC Transport Protocol Analysis

•

High-level description of optimization efforts:

–

Add corresponding point-to-point & collective tuning tables

–

Enhanced UD+RC hybrid transport mode tuned for Broadcom adapter

–

Optimized default CPU mapping policy

•

Make hybrid spread CPU mapping policy as default

–

Support for asynchronous threading progress

–

UD Startup Optimization

–

Point-to-point Message Coalescing

–

SGL packetized eager communication

Performance Optimization

•

Provide up to 2.3x faster UD startup in small 4 nodes scale

•

Provide up to 2.1x faster UD startup in large 64 nodes scale

UD Startup Optimization

Pt-to-Pt Coalescing Performance

osu_bw

osu_bibw

Enabling/disabling coalescing has an impact on BW and BiBW

Coalescing effective up to 1K message size

Up to 1.6x higher bandwidth, 2.7x higher bi-bandwidth with medium sized messages

Single-Pair Message Rate - Coalescing Performance

osu_mbw_mr

1.5x

•

Test Name:

Single-Pair Bandwidth and

Message Rate Test

•

Evaluation Focus:

Aggregate uni-

directional bandwidth and message

rate

•

Participants:

1 process per node

•

Sending Process Behavior:

•

Sends a fixed number of messages

(window size)

•

Sends messages back-to-back to the

paired receiving process

•

Waits for a reply from the receiver

•

Iterations: Repeated for several

iterations

1.

Enabling/disabling coalescing has an impact on BW and

BiBW (Coalescing effective up to 1K message size)

2.

Up to 1.5x higher bandwidth with limits

SGL packetized eager communication – 100 GbE

•

Reduce up to 16% alltoall latency

for 4 bytes messages size

•

Enable by adding

MV2_USE_EAGER_SGL=1

runtime parameter

•

Enabled by default for 1B ~ 1

KB

message sizes

Overview

•

Introduction

•

Performance Characterization

•

Performance Optimization

•

Performance Evaluation

–

Micro-benchmark level

–

Application level

•

MVAPICH 3.0 RC Performance Evaluation

•

Experiment results from Dell Bluebonnet

•

Up to 20% reduction in small message point-to-

point latency

•

From 0.1x to 2x increase in bandwidth

•

Up to 12.4x lower MPI_Allreduce latency

•

Up to 5x lower MPI_Scatter latency

Performance Evaluation – Micro-benchmarks

•

Experiment results from Rattler2

•

Up to 53% reduction in medium message

point-to-point latency

•

Up to 2.6x increase in bandwidth

•

Up to 35% reduction in alltoall latency

Performance Evaluation – Micro-benchmarks

•

Reduce up to 45% execution time of OpenFOAM Motorbike on 16 nodes 128 PPN scale

•

Reduce up to 51% execution time of GROMACS benchPEP on 64 nodes 128 PPN scale

Performance Evaluation – Applications

•

Reduce up to 45% execution time of CP2K H2O-dft-ls

(NREP4)

•

Reduce up to 7% execution time of WRF CONUS 3KM

Performance Evaluation – Applications

Overview

•

Introduction

•

Performance Characterization

•

Performance Optimization

•

Performance Evaluation

–

Micro-benchmark level

–

Application level

•

MVAPICH 3.0 RC Performance Evaluation

•

MVAPICH 3.0 provides competitive point-to-point performance

•

Reduce 9% latency with 16KB message size

MVAPICH-3.0 Pt-to-Pt Latency (RC) on FW 227 (RHEL 8.8)

•

MVAPICH 3.0 provides competitive point-to-point UD performance

•

Reduce 28% latency with 16KB message size

MVAPICH-3.0 Pt-to-Pt Latency (UD) on FW 227 (RHEL 8.8)

•

Conclusion:

–

Analyze MPI overheads vs. IB level performance on Broadcom adapter

–

Significant microbenchmark and application level gains

–

Suc

•

Future Work:

–

Optimize additional applications

–

Integrate existing optimizations with MVAPICH-3.0 on Broadcom systems

–

In progress: MVAPICH-2.3.8 (with enhanced RoCEv2 support)

Conclusion & Future Work

THANK YOU!

Network-Based Computing Laboratory

http://nowlab.cse.ohio-state.edu

Slide Note

Embed Share

Download

This project focuses on optimizing MPI communication operations using Broadcom RoCE technology for high-performance computing applications. It discusses the benefits of RoCE for HPC, the goal of highly optimized MPI for Broadcom RoCEv2, and the overview of the MVAPICH2 Project, a high-performance open-source MPI library supporting various interconnects and platforms.

msigm Follow

Uploaded on Oct 10, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

High Performance & Scalable MPI library over Broadcom RoCE Mustafa Abduljabbar abduljabbar.1osu.edu 11/12/2023 Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

Introduction https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/RDMA-over-Converged-Ethernet.html 2 Network Based Computing Laboratory SC 23

Why RoCE for HPC? Enhanced Performance: RoCE delivers significantly lower latency and higher throughput compared to traditional Ethernet, empowering HPC applications to achieve peak performance. Optimized Efficiency: RoCE offloads RDMA operations to specialized hardware, reducing CPU overhead and freeing up valuable processing resources for critical computational tasks. Cost-Effective Solution: RoCE leverages existing Ethernet infrastructure, minimizing upfront investment and simplifying network management. Scalable and Flexible: RoCE supports a range of Ethernet speeds and Layer 3 routing, enabling scalability and flexibility to meet evolving HPC demands. Emerging HPC Interconnect Standard: RoCE is gaining widespread adoption in the HPC community, recognized for its ability to meet the stringent performance requirements of next- generation HPC systems. 3 Network Based Computing Laboratory SC 23

Goal: Highly optimized MPI for Broadcom RoCEv2 1. MVAPICH-CPU release: Optimizing MPI communication operations on new generation Broadcom adapters We provide support for newer generation Broadcom network adapters (Thor 200 Gbps) in MVAPICH2 and optimize the communication protocols (RC, UD, Hybrid) Focus will be towards point-to-point operations (two-sided) and frequently used collective operations (such as Allreduce and Alltoall). Benefits of these designs will be studied at the applications level. These design changes will be incorporated into the future MVAPICH release. 2. MVAPICH-GPU release: Exploring the use of GPUDirect capabilities in new Broadcom adapters for high- performance data transfers to/from GPU device memory Broadcom has introduced support for GPUDirect RDMA to enable high-performance communication operations from device memory. We study and evaluate the performance of Broadcom s GPUDirect technology with Thor adapters. We explore designs in MVAPICH2-GDR for accelerating relevant portions of device-based communication operations using GPUDirect technology with Thor adapters. The focus will be on point-to-point intra-node, inter-node, and commonly used collectives (Allreduce and Alltoall). The designs will be incorporated into the future MVAPICH2-GDR release. 4 Network Based Computing Laboratory SC 23

Overview of the MVAPICH2 Project High Performance open-source MPI Library Support for multiple interconnects InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), AWS EFA, OPX, Broadcom RoCE, Intel Ethernet, Rockport Networks, Slingshot 10/11 Support for multiple platforms x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD) Used by more than 3,325 organizations in 90 countries Started in 2001, first open-source version demonstrated at SC 02 More than 1.73 Million downloads from the OSU site Supports the latest MPI-3.1 standard directly http://mvapich.cse.ohio-state.edu Empowering many TOP500 clusters (Nov 23 ranking) Additional optimized versions for different systems/environments: 11th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China MVAPICH2-X (Advanced MPI + PGAS), since 2011 MVAPICH2-GDR with support for NVIDIA (since 2014) and AMD (since 2020) GPUs 29th , 448, 448 cores (Frontera) at TACC MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 46th, 288,288 cores (Lassen) at LLNL MVAPICH2-Virt with virtualization support, since 2015 61st, 570,020 cores (Nurion) in South Korea and many others MVAPICH2-EA with support for Energy-Awareness, since 2015 Available with software stacks of many vendors and Linux MVAPICH2-Azure for Azure HPC IB instances, since 2019 Distros (RedHat, SuSE, OpenHPC, and Spack) MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019 Tools: Partner in the 21st ranked TACC Frontera system OSU MPI Micro-Benchmarks (OMB), since 2003 Empowering Top500 systems for more than 16 years OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015 5 Network Based Computing Laboratory SC 23

Overview Introduction Performance Characterization MPI performance overheads vs. IB level Performance Optimization Performance Evaluation Micro-benchmark level Application level MVAPICH 3.0 RC Performance Evaluation 6 Network Based Computing Laboratory SC 23

Cluster Setup * *Courtesy of DELL Technology 8 Network Based Computing Laboratory SC 23

RDMA Protocol Performance Characterization on 100 GbE RC UD Pt2pt Latency Pt2pt Latency 400 9 IB level IB lvl 8 350 MVAPICH2 7 Latency (Microseconds) Latency (Microseconds) 300 5.34us 6 250 5 200 On-par with IB-verbs 4 150 3 3.82us 100 2 50 3.93us 1 0 0 2M 32 1K 2K 4K 8K 16 64 1M 4M 2 4 8 128 256 512 16K 32K 64K 128K 256K 512K 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (Bytes) Message Size (Bytes) 10 Network Based Computing Laboratory SC 23

MPI level Overhead Point-to-point Latency UD Overhead RC Overhead Latency Overheads Latency Overheads 3000 16.86us 18000 2420ns MV2 Overhead 16000 MV2 Overhead 2500 Overhead (Nanoseconds) Overhead (Nanoseconds) 14000 2000 12000 10000 1500 1520ns 8000 1000 6000 4000 500 2000 140ns 0 0 2 4 8 16 32 64 128 256 512 1K 2K 4K 1M 16 32 64 1K 2K 4K 8K 2M 4M 2 4 8 128 256 512 128K 256K 512K 16K 32K 64K Message Size (Bytes) Message Size (Bytes) 11 Network Based Computing Laboratory SC 23

Overview Introduction Performance Characterization MPI performance overheads vs. IB level Performance Optimization Performance Evaluation Micro-benchmark level Application level MVAPICH 3.0 RC Performance Evaluation 12 Network Based Computing Laboratory SC 23

Performance Optimization High-level description of optimization efforts: Add corresponding point-to-point & collective tuning tables For up to 64 nodes x 128 PPN = Based on Dell Bluebonnet (CPU) system and Rattler2 (GPU) system Enhanced UD+RC hybrid transport mode tuned for Broadcom adapter Optimized default CPU mapping policy Support for asynchronous threading progress Startup Optimization Point-to-point Message Coalescing SGL packetized eager communication 13 Network Based Computing Laboratory SC 23

UD + RC Transport Protocol Analysis osu_latency 1000 MVAPICH2 RC RC has better performance vs. UD in most cases MVAPICH2 UD Latency (Microseconds) 100 UD becomes exclusive on large scales (e.g. alltoall with >= 16 nodes) 10 Tuned hybrid transport mode 1 Use RC for small scale & message sizes 64 16 32 8K 1M 1K 2K 4K 2M 4M 1 2 4 8 16K 32K 64K 128 256 512 128K 256K 512K Use UD for the other cases Message Size (Bytes) Allreduce - 16 Nodes, 128 PPN osu_alltoall 16 Nodes, 128 PPN 10000 1000000 MVAPICH2 RC 800000 MVAPICH2 UD 1000 Latency (Microseconds) Latency (Microseconds) 600000 100 400000 10 200000 1 0 1K 2K 4K 8K 16 32 64 1M 4 8 16K 32K 64K 128 Message Size (Bytes) 256 512 128K 256K 512K 256 512 1K 2K 4K 8K 16K Message Size (Bytes) 14 Network Based Computing Laboratory SC 23

Performance Optimization High-level description of optimization efforts: Add corresponding point-to-point & collective tuning tables Enhanced UD+RC hybrid transport mode tuned for Broadcom adapter Optimized default CPU mapping policy Make hybrid spread CPU mapping policy as default Support for asynchronous threading progress UD Startup Optimization Point-to-point Message Coalescing SGL packetized eager communication 15 Network Based Computing Laboratory SC 23

UD Startup Optimization 64NodeUDStartup 4 Node UD Startup 10000 8057 4000 ud-head ud-fix Startup Time (ms) 7962 3341 Startup Time (ms) ud-head ud-fix 8000 3120 3000 4737 6000 4645 3492 2000 1719 4000 2954 1496 1884 1738 438 1049 817 2000 770 477 819 764 654 1000 531 245 204 585 528 475 436 425 340 0 226 190 184 0 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 PPN PPN Provide up to 2.3x faster UD startup in small 4 nodes scale Provide up to 2.1x faster UD startup in large 64 nodes scale 16 Network Based Computing Laboratory SC 23

Pt-to-Pt Coalescing Performance osu_bw osu_bibw 14000 25000 enable_coalesce enable_coalesce 12000 20000 Bandwidth (MB/s) Bandwidth (MB/s) disable disable 10000 15000 2.7x 8000 1.6x 6000 10000 4000 5000 2000 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) Message Size (Bytes) - Enabling/disabling coalescing has an impact on BW and BiBW - Coalescing effective up to 1K message size - Up to 1.6x higher bandwidth, 2.7x higher bi-bandwidth with medium sized messages 17 Network Based Computing Laboratory SC 23

Single-Pair Message Rate - Coalescing Performance osu_mbw_mr Test Name: Single-Pair Bandwidth and Message Rate Test Pt2pt Message Rate 1 PPN 4000 Evaluation Focus: Aggregate uni- directional bandwidth and message rate enable_coalesce 3500 Message Rate (x1000 Messages/s) disable 3000 2500 Participants: 1 process per node 1.5x 2000 Sending Process Behavior: 1500 Sends a fixed number of messages (window size) 1000 500 Sends messages back-to-back to the paired receiving process 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) Waits for a reply from the receiver Iterations: Repeated for several iterations 1. Enabling/disabling coalescing has an impact on BW and BiBW (Coalescing effective up to 1K message size) Up to 1.5x higher bandwidth with limits 2. 18 Network Based Computing Laboratory SC 23

SGL packetized eager communication 100 GbE osu_alltoall 2 x 128 procs 500 Disable 450 Reduce up to 16% alltoall latency for 4 bytes messages size 400 Enable Latency (us) 350 300 250 200 Enable by adding MV2_USE_EAGER_SGL=1 runtime parameter 150 100 50 0 1 2 4 8 16 32 64 128 256 512 1K osu_alltoall 2 x 128 procs 10000 Enabled by default for 1B ~ 1KB message sizes Disable Enable Latency (us) 1000 100 10 1 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes) 19 Network Based Computing Laboratory SC 23

Overview Introduction Performance Characterization Performance Optimization Performance Evaluation Micro-benchmark level Application level MVAPICH 3.0 RC Performance Evaluation 20 Network Based Computing Laboratory SC 23

Performance Evaluation Micro-benchmarks Experiment results from Dell Bluebonnet osu_latency Small Messages osu_bw Large Messages 8 15000 Up to 20% reduction in small message point-to- point latency Bandwidth (MB/s) 6 Latency (Microseconds) 10000 4 2.3.x-broadcom 5000 From 0.1x to 2x increase in bandwidth OpenMPI 2 2.3.7 Up to 12.4x lower MPI_Allreduce latency 0 0 1 2 4 8 128 256 512 16K 32K 64K 16 32 64 1K 2K 4K 4K 8K 1M 2M 4M 128K 256K 512K Up to 5x lower MPI_Scatter latency Message Size (Bytes) Message Size (Bytes) Allreduce - 64 Nodes, 128 PPN Scatter - 64 Nodes, 128 PPN Alltoall - 64 Nodes, 128 PPN 10000 100000 10000000 1000000 10000 1000 Latency (Microseconds) 100000 Latency (Microseconds) Latency (Microseconds) 1000 10000 100 1000 100 2.3.x-broadcom 100 OpenMPI 10 10 10 2.3.7 1 1 1 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 1 2 4 8 1K 2K 4K 8K 16 32 64 128 256 512 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) 21 Network Based Computing Laboratory SC 23

Performance Evaluation Micro-benchmarks Experiment results from Rattler2 osu_latency Small Messages osu_bw Large Messages 25 15000 Up to 53% reduction in medium message point-to-point latency 20 Latency (Microseconds) Bandwidth (MB/s) 10000 15 10 5000 Up to 2.6x increase in bandwidth 5 Up to 35% reduction in alltoall latency 0 0 1 2 4 8 128 256 512 16K 32K 64K 1M 2M 4M 128K 256K 512K 16 32 64 1K 2K 4K 4K 8K Message Size (Bytes) Message Size (Bytes) Alltoall - 2 Nodes, 4 PPN Allreduce - 2 Nodes, 4 PPN Bcast - 2 Nodes, 4 PPN 1000 10000 10000 MV2GDR Opt Latency (Microseconds) 1000 Latency (Microseconds) 1000 OpenMPI Latency (Microseconds) 100 MV2GDR 100 100 10 10 10 1 1 1 1 4 16 64 256 1K 4K 16K 64K 256K 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K Message Size (Bytes) 4K 16K 64K 256K 1M Message Size (Bytes) Message Size (Bytes) 22 Network Based Computing Laboratory SC 23

Performance Evaluation Applications OpenFOAM GROMACS - benchPEP - 128 PPN 90x36x36 (15.5M cells) Motorbike 128 PPN 300 600 246 495 494 234 250 473 500 224 200 400 Seconds 153 Seconds 145 140 150 300 232 107 225 102 214 99 94 92 90 100 80 77 200 61 56 55 53 52 50 111 100 41 95 92 50 81 77 100 61 61 42 0 0 1 2 4 8 16 32 64 1 2 4 8 16 Nodes Nodes 2.3.x-broadcom 2.3.7 OpenMPI 2.3.x-broadcom 2.3.7 OpenMPI Reduce up to 45% execution time of OpenFOAM Motorbike on 16 nodes 128 PPN scale Reduce up to 51% execution time of GROMACS benchPEP on 64 nodes 128 PPN scale 23 Network Based Computing Laboratory SC 23

Performance Evaluation Applications WRF CONUS 12KM 128 PPN CP2K H2O-dft-ls (NREP4) 128 PPN 250 210 500 2.3.x-broadcom OpenMPI 2.3.7 203 202 438 428 200 450 Execution Time (Sec) 400 Execution Time (Sec) 150 122 321 350 115 113 306 297 296 273 269 300 100 255 77 73 65 250 51 49 43 50 200 150 0 100 1*128 2*128 4*128 8*128 50 # Nodes * PPN 0 WRF CONUS 3KM 128 PPN 2x128 4x128 8x128 16x128 32x128 1127 #. Nodes * PPN 1200 1095 1021 1000 Execution Time (Sec) MVAPICH2 OpenMPI 800 579 563 543 600 Reduce up to 45% execution time of CP2K H2O-dft-ls (NREP4) 348 321 400 303.2 207 197 200 Reduce up to 7% execution time of WRF CONUS 3KM 0 8*128 16*128 32*128 64*128 # Nodes * PPN 24 Network Based Computing Laboratory SC 23

Overview Introduction Performance Characterization Performance Optimization Performance Evaluation Micro-benchmark level Application level MVAPICH 3.0 RC Performance Evaluation 25 Network Based Computing Laboratory SC 23

MVAPICH-3.0 Pt-to-Pt Latency (RC) on FW 227 (RHEL 8.8) Latency Small Message Latency Large Message 9 400 MV3.0 MV3.0 8 350 MV2.3.X MV2.3.X Latency (us) Latency (us) 7 300 OMPI OMPI 6 250 5 200 4 150 3 100 2 50 1 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) MVAPICH 3.0 provides competitive point-to-point performance Reduce 9% latency with 16KB message size 26 Network Based Computing Laboratory SC 23

MVAPICH-3.0 Pt-to-Pt Latency (UD) on FW 227 (RHEL 8.8) Latency Small Message Latency Large Message 10 500 MV3.0 MV3.0 9 450 8 400 Latency (us) Latency (us) MV2.3.X MV2.3.X 7 350 6 300 5 250 4 200 3 150 2 100 1 50 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) MVAPICH 3.0 provides competitive point-to-point UD performance Reduce 28% latency with 16KB message size 27 Network Based Computing Laboratory SC 23

Conclusion & Future Work Conclusion: Analyze MPI overheads vs. IB level performance on Broadcom adapter Significant microbenchmark and application level gains Suc Future Work: Optimize additional applications Integrate existing optimizations with MVAPICH-3.0 on Broadcom systems In progress: MVAPICH-2.3.8 (with enhanced RoCEv2 support) 28 Network Based Computing Laboratory SC 23

THANK YOU! Network Based Computing Laboratory Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ 29 Network Based Computing Laboratory SC 23

Enhancing HPC Performance with Broadcom RoCE MPI Library

Download Presentation

Presentation Transcript

Related

More Related Content