Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters

Efficient Collective Operations using Remote

Memory Operations on VIA-Based Clusters

Rinku Gupta

Rinku Gupta

Dell Computers

Dell Computers

Rinku_Gupta@Dell.Com

Rinku_Gupta@Dell.Com

Dhabaleswar Panda

Dhabaleswar Panda

The Ohio State University

The Ohio State University

panda@cis.ohio-state.edu

panda@cis.ohio-state.edu

Pavan Balaji

Pavan Balaji

The Ohio State University

The Ohio State University

balaji@cis.ohio-state.edu

balaji@cis.ohio-state.edu

Jarek Nieplocha

Jarek Nieplocha

Pacific Northwest National Lab

Pacific Northwest National Lab

jarek.nieplocha@pnl.com

jarek.nieplocha@pnl.com

Contents

Contents



  Motivation

  Motivation



  Design Issues

  Design Issues



  RDMA-based Broadcast

  RDMA-based Broadcast



  RDMA-based All Reduce

  RDMA-based All Reduce



  Conclusions and Future Work

  Conclusions and Future Work

Motivation

Motivation

•

Communication Characteristics of Parallel Applications

Communication Characteristics of Parallel Applications

•

Point-to-Point Communication

 Send and Receive primitives

•

Collective Communication

 Barrier, Broadcast, Reduce, All Reduce

 Built over Send-Receive Communication primitives

•

Communication Methods for Modern Protocols

Communication Methods for Modern Protocols

•

Send and Receive Model

•

Remote Direct Memory Access (RDMA) Model

Remote Direct Memory Access

Remote Direct Memory Access

•

Remote Direct Memory Access (RDMA) Model

Remote Direct Memory Access (RDMA) Model

RDMA Write

RDMA Read (Optional)

•

Widely supported by modern protocols and architectures

Widely supported by modern protocols and architectures

Virtual Interface Architecture (VIA)

InfiniBand Architecture (IBA)

•

Open Questions

Open Questions

Can RDMA be used to optimize Collective Communication? [rin02]

Do we need to rethink algorithms optimized for Send-Receive?

[rin02]: “Efficient Barrier using Remote Memory Operations on VIA-based Clusters”, Rinku

Gupta, V. Tipparaju, J. Nieplocha, D. K. Panda. Presented at Cluster 2002, Chicago, USA

Send-Receive and RDMA

Send-Receive and RDMA

Communication Models

Communication Models

descriptor

descriptor

descriptor

Send/Recv

RDMA Write

Benefits of RDMA

Benefits of RDMA

•

 RDMA gives a  shared memory illusion

•

 Receive operations are typically expensive

•

 RDMA is Receiver transparent

•

 Supported by VIA and InfiniBand architecture

•

 A novel unexplored method

Contents

Contents



  Motivation

  Motivation



  Design Issues

  Design Issues



 Buffer Registration

 Buffer Registration



 Data Validity at Receiver End

 Data Validity at Receiver End



 Buffer Reuse

 Buffer Reuse



  RDMA-based Broadcast

  RDMA-based Broadcast



  RDMA-based All Reduce

  RDMA-based All Reduce



  Conclusions and Future Work

  Conclusions and Future Work

Buffer Registration

Buffer Registration

•

Static Buffer Registration

Static Buffer Registration



 Contiguous region in memory for every communicator



 Address exchange is done during initialization time

•

Dynamic Buffer Registration - Rendezvous

Dynamic Buffer Registration - Rendezvous



 User buffers, registered during the operation, when needed



 Address exchange is done during the operation

Data Validity at Receiver End

Data Validity at Receiver End

•

 Interrupts

 Interrupts

•

 Too expensive; might not be supported

•

 Use Immediate field of VIA descriptor

 Use Immediate field of VIA descriptor

•

 Consumes a receive descriptor

•

 RDMA write a Special byte to a pre-defined location

 RDMA write a Special byte to a pre-defined location

Buffer Reuse

Buffer Reuse

•

 Static Buffer Registration

 Static Buffer Registration



 Buffers need to be reused



 Explicit notification has to be sent to sender

•

 Dynamic Buffer Registration

 Dynamic Buffer Registration



 No buffer Reuse

Contents

Contents



  Motivation

  Motivation



  Design Issues

  Design Issues



  RDMA-based Broadcast

  RDMA-based Broadcast



 Design Issues

 Design Issues



 Experimental Results

 Experimental Results



 Analytical Models

 Analytical Models



  RDMA-based All Reduce

  RDMA-based All Reduce



  Conclusions and Future Work

  Conclusions and Future Work

Buffer Registration and Initialization

Buffer Registration and Initialization

•

 Static Registration Scheme (for size <= 5K bytes)

 Static Registration Scheme (for size <= 5K bytes)

P0

P1

P2

P3

Constant

Block size

 Notify

 Buffer

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

Dynamic Registration Scheme (for size >  5K) --

Dynamic Registration Scheme (for size >  5K) --

Rendezvous scheme

-1

-1

-1

Data Validity at Receiver End

Data Validity at Receiver End

P0

P1

P2

P3

-1

-1

-1

-1

-1

-1

-1

-1

Constant

Block size

•

 Broadcast counter = 1

 (First Broadcast with Root P0)

Data size

Broadcast

counter

 Notify

 Buffer

Buffer Reuse

Buffer Reuse

P0

P1

P2

P3

 Notify

 Buffer

Broadcast

 Buffer

P0

P1

P2

P3

Performance Test Bed

Performance Test Bed

16 1GHz PIII nodes, 33MHz PCI bus, 512MB RAM.

Machines connected using GigaNet cLAN 5300 switch.

MVICH Version : mvich-1.0

•

 Integration with MVICH-1.0

•

 MPI_Send modified to support RDMA Write

Timings were taken for varying

block sizes

•

 Tradeoff between number of blocks and size of blocks

RDMA Vs Send-Receive Broadcast (16 nodes)

RDMA Vs Send-Receive Broadcast (16 nodes)

•

 Improvement ranging from 14.4% (large messages) to 19.7% (small messages)

•

 Block size of 3K is performing the best

19.7%

14.4%

Anal. and Exp. Comparison (16 nodes)

Anal. and Exp. Comparison (16 nodes)

Broadcast

Broadcast

•

 Error difference of lesser than 7%

RDMA Vs Send-Receive for Large Clusters

RDMA Vs Send-Receive for Large Clusters

(Analytical Model Estimates: Broadcast)

(Analytical Model Estimates: Broadcast)

16%

21%

16%

21%

•

 Estimated Improvement ranging from 16% (small messages) to 21% (large

messages) for large clusters of sizes 512 nodes and 1024 nodes

Contents

Contents



  Motivation

  Motivation



  Design Issues

  Design Issues



  RDMA-based Broadcast

  RDMA-based Broadcast



  RDMA-based All Reduce

  RDMA-based All Reduce



 Degree-K tree

 Degree-K tree



 Experimental Results (Binomial & Degree-K)

 Experimental Results (Binomial & Degree-K)



 Analytical Models (Binomial & Degree-K)

 Analytical Models (Binomial & Degree-K)



  Conclusions and Future Work

  Conclusions and Future Work

Degree-K tree-based Reduce

Degree-K tree-based Reduce

K = 1

K = 3

K = 7

Experimental Evaluation

•

 Integrated into MVICH-1.0

•

 Reduction Operation = MPI_SUM

•

 Data type = 1 INT (data size = 4 bytes)

•

 Count = 1 (4 bytes) to 1024 (4096) bytes

•

 Finding the optimal Degree-K

•

 Experimental Vs Analytical (best case & worst case)

•

 Exp. and Anal. comparison of Send-Receive with RDMA

4 nodes

8 nodes

16 nodes

Degree-3

Degree-3

Degree-7

Degree-7

Degree-3

Degree-3

Degree-3

Degree-3

Degree-1

Degree-1

Degree-3

Degree-3

Degree-1

Degree-1

Degree-3

Degree-3

Degree-1

Degree-1

4-256B

256-1KB

Beyond 1KB

Choosing the Optimal Degree-K for

Choosing the Optimal Degree-K for

All Reduce

All Reduce

•

 For lower message sizes, higher degrees perform better than degree-1 (binomial)

Degree-K RDMA-based All Reduce

Degree-K RDMA-based All Reduce

Analytical Model

Analytical Model

•

 Experimental timings fall between the best case and the worst case analytical estimates

•

 For lower message sizes, higher degrees perform better than degree-1 (binomial)

4 nodes

8 nodes

16 nodes

Degree-3

Degree-3

Degree-7

Degree-7

Degree-3

Degree-3

Degree-3

Degree-3

Degree-1

Degree-1

Degree-3

Degree-3

Degree-1

Degree-1

Degree-3

Degree-3

Degree-1

Degree-1

4-256B

256-1KB

Beyond 1KB

Degree-3

Degree-3

Degree-3

Degree-3

Degree-1

Degree-1

Degree-3

Degree-3

Degree-3

Degree-3

Degree-1

Degree-1

1024 nodes

512 nodes

Binomial Send-Receive Vs

Binomial Send-Receive Vs

Optimal & Binomial Degree-K RDMA (16 nodes) All Reduce

Optimal & Binomial Degree-K RDMA (16 nodes) All Reduce

38.13%

9%

•

 Improvement ranging from 9% (large messages) to 38.13% (small messages) for the

optimal degree-K RDMA-based All Reduce compared to Binomial Send-Receive

Binomial Send-Receive Vs

Binomial Send-Receive Vs

Binomial & Optimal Degree-K All Reduce for large clusters

Binomial & Optimal Degree-K All Reduce for large clusters

•

 Improvement ranging from 14% (large messages) to 35-40% (small messages) for

the optimal degree-K RDMA-based All Reduce compared to Binomial Send-Receive

35-40%

14%

35-41%

14%

Contents

Contents



  Motivation

  Motivation



  Design Issues

  Design Issues



  RDMA-based Broadcast

  RDMA-based Broadcast



  RDMA-based All Reduce

  RDMA-based All Reduce



  Conclusions and Future Work

  Conclusions and Future Work

Conclusions

•

Novel method to implement the collective communication library

•

 Degree-K algorithm to exploit the benefits of RDMA

•

 Implemented the RDMA-based Broadcast and All Reduce

•

 Broadcast: 19.7% improvement for small and 14.4% for large messages

(16nodes)

•

 All Reduce: 38.13% for small messages, 9.32% for large messages

(16nodes)

•

 Analytical models for Broadcast and All Reduce

•

 Estimate Performance benefits of large clusters

•

 Broadcast: 16-21% for 512 and 1024 node clusters

•

 All Reduce: 14-40% for 512 and 1024 node clusters

Future Work

•

 Exploit the RDMA Read feature if available

•

 Round-trip cost design issues

•

 Extend to MPI-2.0

•

 One sided Communication

•

 Extend framework to emerging InfiniBand architecture

For more information, please visit the

http://nowlab.cis.ohio-state.edu

Network Based Computing Group,

The Ohio State University

Thank You!

Backup Slides

Receiver Side Best for Large messages

Receiver Side Best for Large messages

(Analytical Model)

(Analytical Model)

P3

P3

P2

P2

P1

P1

 = ( Tt * k ) + Tn + Ts + To + Tc

 = ( Tt * k ) + Tn + Ts + To + Tc

k - No of Sending nodes

k - No of Sending nodes

P3

P3

P2

P2

P1

P1

To

To

To

To

To

To

Receiver Side Worst for Large messages

Receiver Side Worst for Large messages

(Analytical Model)

(Analytical Model)

 = ( Tt * k ) + Tn + Ts + ( To * k ) + Tc

 = ( Tt * k ) + Tn + Ts + ( To * k ) + Tc

k - No of Sending nodes

k - No of Sending nodes

Buffer Registration and Initialization

Buffer Registration and Initialization

•

 Static Registration Scheme (for size <= 5K)

 Static Registration Scheme (for size <= 5K)

P0

P1

P2

P3

Constant

Block size

(5K+1)

P1

P2

P3

Each block is of size

5K+1.

 Every process has

blocks, where

N  is the number of processes in the communicator

Data Validity at Receiver End

Data Validity at Receiver End

Slide Note

Check the title

Embed Share

Download

This article discusses the use of Remote Direct Memory Access (RDMA) to optimize collective communication in parallel applications. It explores the communication characteristics, RDMA models, benefits of RDMA, and design issues related to buffer registration, data validity, and reuse. The potential of RDMA in modern protocols and architectures like Virtual Interface Architecture (VIA) and InfiniBand Architecture (IBA) is highlighted.

amcsw Follow

Uploaded on Sep 29, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Rinku_Gupta@Dell.Com Pavan Balaji The Ohio State University balaji@cis.ohio-state.edu Jarek Nieplocha Pacific Northwest National Lab jarek.nieplocha@pnl.com Dhabaleswar Panda The Ohio State University panda@cis.ohio-state.edu

Contents Motivation Design Issues RDMA-based Broadcast RDMA-based All Reduce Conclusions and Future Work

Motivation Communication Characteristics of Parallel Applications Point-to-Point Communication o Send and Receive primitives Collective Communication o Barrier, Broadcast, Reduce, All Reduce o Built over Send-Receive Communication primitives Communication Methods for Modern Protocols Send and Receive Model Remote Direct Memory Access (RDMA) Model

Remote Direct Memory Access Remote Direct Memory Access (RDMA) Model o RDMA Write o RDMA Read (Optional) Widely supported by modern protocols and architectures o Virtual Interface Architecture (VIA) o InfiniBand Architecture (IBA) Open Questions o Can RDMA be used to optimize Collective Communication? [rin02] o Do we need to rethink algorithms optimized for Send-Receive? [rin02]: Efficient Barrier using Remote Memory Operations on VIA-based Clusters , Rinku Gupta, V. Tipparaju, J. Nieplocha, D. K. Panda. Presented at Cluster 2002, Chicago, USA

Send-Receive and RDMA Communication Models Registered User buffer User buffer User buffer User buffer Registered Registered Registered descriptor descriptor descriptor R R S S NIC NIC NIC NIC Send/Recv RDMA Write

Benefits of RDMA RDMA gives a shared memory illusion Receive operations are typically expensive RDMA is Receiver transparent Supported by VIA and InfiniBand architecture A novel unexplored method

Contents Motivation Design Issues Buffer Registration Data Validity at Receiver End Buffer Reuse RDMA-based Broadcast RDMA-based All Reduce Conclusions and Future Work

Buffer Registration Static Buffer Registration Contiguous region in memory for every communicator Address exchange is done during initialization time Dynamic Buffer Registration - Rendezvous User buffers, registered during the operation, when needed Address exchange is done during the operation

Data Validity at Receiver End Interrupts Too expensive; might not be supported Use Immediate field of VIA descriptor Consumes a receive descriptor RDMA write a Special byte to a pre-defined location

Buffer Reuse Static Buffer Registration Buffers need to be reused Explicit notification has to be sent to sender Dynamic Buffer Registration No buffer Reuse

Contents Motivation Design Issues RDMA-based Broadcast Design Issues Experimental Results Analytical Models RDMA-based All Reduce Conclusions and Future Work

Buffer Registration and Initialization Static Registration Scheme (for size <= 5K bytes) P0 P1 P2 P3 Constant Block size -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Notify Buffer Dynamic Registration Scheme (for size > 5K) -- Rendezvous scheme

Data Validity at Receiver End Broadcast counter = 1 (First Broadcast with Root P0) P0 P1 P2 P3 Constant Block size Data size 1 -1 1 -1 1 -1 1 Broadcast counter -1 -1 -1 -1 -1 -1 -1 -1 Notify Buffer

Buffer Reuse P0 P1 P2 P3 Broadcast Buffer 1 1 1 1 2 2 2 2 2 2 2 2 Notify Buffer 1 1 1 P0 P1 P2 P3

Performance Test Bed 16 1GHz PIII nodes, 33MHz PCI bus, 512MB RAM. Machines connected using GigaNet cLAN 5300 switch. MVICH Version : mvich-1.0 Integration with MVICH-1.0 MPI_Send modified to support RDMA Write Timings were taken for varying block sizes Tradeoff between number of blocks and size of blocks

RDMA Vs Send-Receive Broadcast (16 nodes) 350 14.4% 300 250 Latency (us) 200 150 100 19.7% 50 0 1024 1536 2048 2560 3072 3584 4096 4608 128 256 512 16 32 64 4 8 Message Size (bytes) RDMA 4K bytes/block RDMA 1K bytes/block RDMA 3K bytes/block Send-Receive RDMA 2K bytes/block Improvement ranging from 14.4% (large messages) to 19.7% (small messages) Block size of 3K is performing the best

Anal. and Exp. Comparison (16 nodes) Broadcast 300 250 200 Latency (us) Analytical Experimental 150 100 50 0 128 256 512 1024 1536 2048 2560 3072 3584 4096 4608 4 8 16 32 64 Message Size (bytes) Error difference of lesser than 7%

RDMA Vs Send-Receive for Large Clusters (Analytical Model Estimates: Broadcast) 1024 Node Broadcast 512 Nodes Broadcast 21% 700 700 21% 600 600 500 500 Latency (us) Latency (us) 400 400 300 300 200 200 16% 16% 100 100 0 0 1024 2048 4096 128 256 512 1024 2048 4096 16 32 64 128 256 512 4 8 16 32 64 4 8 Message Size (bytes) Message Size (bytes) Send-Receive RDMA Send-Receive RDMA Estimated Improvement ranging from 16% (small messages) to 21% (large messages) for large clusters of sizes 512 nodes and 1024 nodes

Contents Motivation Design Issues RDMA-based Broadcast RDMA-based All Reduce Degree-K tree Experimental Results (Binomial & Degree-K) Analytical Models (Binomial & Degree-K) Conclusions and Future Work

Degree-K tree-based Reduce P0 P0 P0 P1 P1 P1 P2 P2 P2 P3 P3 P3 P4 P4 P4 P5 P5 P5 P6 P6 P6 P7 P7 P7 [ 1 ] [ 1 ] [ 1 ] [ 1 ] [ 1 ] [ 1 ] [ 2 ] [ 2 ] K = 1 K = 3 K = 7 [ 3 ] [ 1 ] [ 2 ]

Experimental Evaluation Integrated into MVICH-1.0 Reduction Operation = MPI_SUM Data type = 1 INT (data size = 4 bytes) Count = 1 (4 bytes) to 1024 (4096) bytes Finding the optimal Degree-K Experimental Vs Analytical (best case & worst case) Exp. and Anal. comparison of Send-Receive with RDMA

Choosing the Optimal Degree-K for All Reduce Optimal Degree-K (16 nodes) 1200 Degree 1 Degree 3 Degree 7 Degree 15 Degree-1 Degree-3 Degree-3 16 nodes 1000 Latency (us) 800 Degree-1 Degree-7 Degree-3 8 nodes 600 400 Degree-1 200 4 nodes Degree-3 Degree-3 0 4-256B Beyond 1KB 4 8 16 32 64 128 256 512 1024 2048 4096 256-1KB Message Size (bytes) For lower message sizes, higher degrees perform better than degree-1 (binomial)

Degree-K RDMA-based All Reduce Analytical Model Experimental Vs Analytical (Degree 3: 16 nodes) Degree-1 Degree-3 Degree-3 1024 nodes 700 Analytical (Best) Analytical (Worst) Experimental 600 Degree-1 Degree-3 Degree-3 512 nodes 500 Latency (us) 400 Degree-1 Degree-3 Degree-3 16 nodes 300 200 Degree-1 Degree-7 Degree-3 100 8 nodes 0 4 8 16 32 64 128 256 512 1024 2048 4096 Degree-1 4 nodes Degree-3 Degree-3 Message Size (bytes) 4-256B Beyond 1KB 256-1KB Experimental timings fall between the best case and the worst case analytical estimates For lower message sizes, higher degrees perform better than degree-1 (binomial)

Binomial Send-Receive Vs Optimal & Binomial Degree-K RDMA (16 nodes) All Reduce 700 9% Binomial Send Receive Optimal Degree-K RDMA Binomial RDMA 600 500 Latency (us) 400 300 200 38.13% 100 0 128 256 512 1024 2048 4096 4 8 16 32 64 Message Size (bytes) Improvement ranging from 9% (large messages) to 38.13% (small messages) for the optimal degree-K RDMA-based All Reduce compared to Binomial Send-Receive

Binomial Send-Receive Vs Binomial & Optimal Degree-K All Reduce for large clusters 512 Node All Reduce 1024 Node All Reduce 1400 1600 Binomial Send Receive Binomial Send Receive 14% 14% 1200 1400 Optimal Degree K (best case) Optimal Degree K (best case) 1200 1000 Optimal Degree-K (worst case) Optimal Degree-K (worst case) Latency (us) Latency (us) 1000 800 Binomial RDMA Binomial RDMA 800 600 600 400 35-41% 35-40% 400 200 200 0 0 4 8 16 32 64 128 256 512 1024 2048 4096 4 8 16 32 64 128 256 512 1024 2048 4096 Message Size (bytes) Message Size (bytes) Improvement ranging from 14% (large messages) to 35-40% (small messages) for the optimal degree-K RDMA-based All Reduce compared to Binomial Send-Receive

Contents Motivation Design Issues RDMA-based Broadcast RDMA-based All Reduce Conclusions and Future Work

Conclusions Novel method to implement the collective communication library Degree-K algorithm to exploit the benefits of RDMA Implemented the RDMA-based Broadcast and All Reduce Broadcast: 19.7% improvement for small and 14.4% for large messages (16nodes) All Reduce: 38.13% for small messages, 9.32% for large messages (16nodes) Analytical models for Broadcast and All Reduce Estimate Performance benefits of large clusters Broadcast: 16-21% for 512 and 1024 node clusters All Reduce: 14-40% for 512 and 1024 node clusters

Future Work Exploit the RDMA Read feature if available Round-trip cost design issues Extend to MPI-2.0 One sided Communication Extend framework to emerging InfiniBand architecture

Thank You! For more information, please visit the NBC Home Page http://nowlab.cis.ohio-state.edu Network Based Computing Group, The Ohio State University

Backup Slides

Receiver Side Best for Large messages (Analytical Model) Tt Tn Ts To P3 Tt Tn Ts To P2 Tt Tn Ts To P1 = ( Tt * k ) + Tn + Ts + To + Tc k - No of Sending nodes

Receiver Side Worst for Large messages (Analytical Model) Tt Tn Ts To P3 Tt Tn Ts To P2 Tt Tn Ts To P1 = ( Tt * k ) + Tn + Ts + ( To * k ) + Tc k - No of Sending nodes

Buffer Registration and Initialization Static Registration Scheme (for size <= 5K) Each block is of size 5K+1. Every process has N blocks, where N is the number of processes in the communicator P0 P1 P2 P3 Constant Block size (5K+1) P1 P2 P3

Data Validity at Receiver End P0 P0 P0 P0 P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 Computed Data Data 1 3 5 5 5 1 1 1 1 Computed Data 2 Data 14 9 1 1 4 4 9 9 9 5 1 1 1 1

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters

Download Presentation

Presentation Transcript

Related

More Related Content