Optimization Strategies for MPI-Interoperable Active Messages

 OPTIMIZATION STRATEGIES FOR

MPI-INTEROPERABLE ACTIVE MESSAGES

University of Illinois at Urbana-Champaign

Argonne National Laboratory

ScalCom’13

December 21, 2013

•

Examples: graph algorithm (BFS), sequence assembly

•

Common characteristics

•

Organized around sparse

structures

•

Communication-to-computation

ratio is high

•

Irregular communication pattern

•

Remote complex computation

Data-Intensive Applications

•

MPI: industry standard communication runtime for high performance

computing

Message Passing Models

+=

two-sided communication

(explicit sends and receives)

one-sided (RMA) communication

(explicit sends, implicit receives, simple

remote operations)

•

Active Messages

•

Sender explicitly sends message

•

Upon message’s arrival, message handler is

triggered, receiver is not explicitly involved

•

User-defined operations on remote process

✗

✗

✓

–

•

–

MPI runtime must ensure consistency of window

•

•

“

”

–

•

“

”

—

•

achieve pipeline effect and reduce buffer requirement

[ICPADS 2013]   X. Zhao, P. Balaji,  W. Gropp, R. Thakur,

“MPI-Interoperable and Generalized Active Messages”

, in proceedings of ICPADS’ 13

–

•

•

Past Work: MPI-Interoperable Generalized AM

How MPI-Interoperable AMs work

unlock

origin

target

ACC(Z)

epoch

lock

PUT(X)

AM(X)

GET(Y)

process 0

process 1

GET (Y)

PUT (X)

process 2

AM (U)

GET (W)

AM (M)

fence

fence

fence

fence

PUT (N)

epoch

epoch

•

Leveraging MPI RMA interface

passive target mode

(EXCLUSIVE or SHARED lock)

active target mode

(Fence or Post-Start-Complete-Wait)

Performance Shortcomings with MPI-AM

•

Synchronization stalls in data

buffering

•

Inefficiency in data transmission

•

Effective strategies are needed to improve performance

•

System level optimization

•

User level optimization

Opt #1: Auto-Detected Exclusive User Buffer

•

MPI internally detects EXCLUSIVE passive mode

•

Optimization is transparent to user

unlock

origin

target

AM 2

epoch

lock

AM 1

EXCLUSIVE



hand-shake

unlock

origin

target

AM 2

epoch

lock

AM 1

SHARED



hand-shake

hand-shake

Only one hand-shake is

required for entire epoch

Hand-shake is required

whenever AM uses user buffer

•

User can define amount of user buffer guaranteed available for

certain processes

•

Beneficial for SHARED passive mode and active mode

•

No hand-shake operation is required!

Opt #2: User-Defined Exclusive User Buffer

Opt #3: Improving Data Transmission

•

Two different models



Contiguous output data layout



Really straightforward and

convenient?



Out-of-order AMs require

buffering or reordering



Non-contiguous output data layout



Require new API — not a big deal



Must transfer back count array



Packing and unpacking



No buffering or reordering is needed



Data packing vs. data transmission



Controlled by system-specific threshold

•

BLUES cluster at ANL: 310 nodes, with each consisting

16 cores, connected with QLogic QDR InfiniBand

•

Based on MPICH-3.1b1

•

Micro-benchmarks: two common operations

•

Remote search of string sequences (20 characters per

sequence)

•

Remote summation of absolute values in two arrays

(100 integers per array)

•

Result data is returned

Experimental Settings

scalability:

25% improvement

Effect of Exclusive User Buffer

system-specific threshold

helps eliminate packing

and unpacking overhead

MPIX_AMV(0.8) transmits

more data than MPIX_AM

due to additional counts array

Comparison between MPIX_AM and MPIX_AMV

•

Data-intensive applications are increasingly important,

MPI is not a well-suited model

•

We proposed MPI-AM framework in our previous work,

to make data-intensive applications more efficient and

require less programming effort

•

There are performance shortcomings in current MPI-AM

framework

•

Our optimization strategies, including auto-detected and

user-defined methods, can effectively reduce

synchronization overhead and improve efficiency of data

transmission

Conclusion

•

•

•

•

•

Thanks for your attention!



BACKUP

Data-Intensive Applications

•

“

Traditional

”

 applications

•

Organized around dense vectors

or matrices

•

Regular communication, use

MPI SEND/RECV or collectives

•

Communication-to-computation

ratio is low

•

Example:

stencil computation,

matrix multiplication, FFT

•

Data-intensive applications

•

Organized around graphs, sparse

vectors

•

Communication pattern is irregular

and data-dependent

•

Communication-to-computation ratio

is high

•

Example: bioinformatics, social

network analysis

Vector Version of AM API

Effect of Exclusive User Buffer

25% improvement

providing more

exclusive buffer

greatly reduces

Auto-detected Exclusive User Buffer

•

Handle detaching user buffers

•

One more hand-shake is

required after

MPI_WIN_FLUSH, because

user buffer may be detached

on target

•

User can pass a hint to tell MPI

that there will be no buffer

detachment, in such case, MPI

will eliminate hand-shake after

MPI_WIN_FLUSH

unlock

origin

target

AM 2

lock

AM 1

EXCLUSIVE



hand-shake

flush

hand-shake

AM 4

AM 3

AM 5

detach user

buffer

•

Two synchronization modes

passive target mode

(SHARED or EXCLUSIVE lock/lock_all)

active target mode

(post-start-complete-wait/fence)

exposure

epoch

Background

—MPI One-sided Synchronization Modes

Slide Note

15 minutes, 12 minutes + 3 minutes Q&A

Embed Share

Download

The study delves into optimization strategies for MPI-interoperable active messages, focusing on data-intensive applications like graph algorithms and sequence assembly. It explores message passing models in MPI, past work on MPI-interoperable and generalized active messages, and how MPI-interoperable active messages work by leveraging the MPI RMA interface.

ash_19 Follow

Uploaded on Sep 21, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

OPTIMIZATION STRATEGIES FOR MPI-INTEROPERABLE ACTIVE MESSAGES Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur University of Illinois at Urbana-Champaign Argonne National Laboratory ScalCom 13 December 21, 2013

Data-Intensive Applications Examples: graph algorithm (BFS), sequence assembly remote node ACGCGATTCAG GCGATTCAGTA ACGCGATTCAGTA remote search DNA consensus sequence local node Common characteristics Organized around sparse structures Communication-to-computation ratio is high Irregular communication pattern Remote complex computation 2

Message Passing Models MPI: industry standard communication runtime for high performance computing Process 0 Process 1 Send (data) Receive (data) Send (data) Receive (data) two-sided communication (explicit sends and receives) Process 0 Put (data) Get (data) Acc (data) Process 1 += one-sided (RMA) communication (explicit sends, implicit receives, simple remote operations) origin target Active Messages Sender explicitly sends message Upon message s arrival, message handler is triggered, receiver is not explicitly involved User-defined operations on remote process messages handler reply handler 3

Past Work: MPI-Interoperable Generalized AM origin input buffer origin output buffer MPI-AM: an MPI-interoperable framework that can dynamically manage data movement and user-defined remote computation. AM input data AM output data Streaming AMs define segment minimum number of elements for AM execution achieve pipeline effect and reduce buffer requirement Explicit and implicit buffer management target input buffer target output buffer AM handler user buffers: rendezvous protocol, guarantee correct execution system buffers: eager protocol, not always enough target persistent buffer MPI-AM workflow Correctness semantics Memory consistency MPI runtime must ensure consistency of window memory barrier memory barrier UNIFIED window model SEPARATE window model AM AM handler handler memory barrier memory barrier flush cache line back Three different type of ordering Concurrency: by default, MPI runtime behaves as if AMs are executed in sequential order. User can release concurrency by setting MPI assert. 4 [ICPADS 2013] X. Zhao, P. Balaji, W. Gropp, R. Thakur, MPI-Interoperable and Generalized Active Messages , in proceedings of ICPADS 13

How MPI-Interoperable AMs work Leveraging MPI RMA interface origin target process 2 process 0 process 1 lock fence fence epoch epoch epoch unlock fence fence passive target mode (EXCLUSIVE or SHARED lock) active target mode (Fence or Post-Start-Complete-Wait) 5

Performance Shortcomings with MPI-AM Synchronization stalls in data buffering Inefficiency in data transmission query DNA sequences ORIGIN TARGET ACGCGATTCAG GCGATTCAGTA ACGCGATTCAGT A remote node 2 remote node 1 segment 1 receive in system buffer remote search stall reserve user buffer AM on system buffer is finished AM segment 2 receive in user buffer receive in system buffer segment 3 local node consensus DNA sequences Effective strategies are needed to improve performance System level optimization User level optimization 6

Opt #1: Auto-Detected Exclusive User Buffer MPI internally detects EXCLUSIVE passive mode origin target origin target lock lock EXCLUSIVE SHARED epoch epoch unlock unlock Only one hand-shake is required for entire epoch Hand-shake is required whenever AM uses user buffer Optimization is transparent to user 7

Opt #2: User-Defined Exclusive User Buffer User can define amount of user buffer guaranteed available for certain processes Beneficial for SHARED passive mode and active mode No hand-shake operation is required! process 0 process 2 win_create process 1 rank size win_create win_create 0 2 20 MB 30 MB fence fence fence fence 8

Opt #3: Improving Data Transmission Two different models AM pipeline unit #1 pipeline unit #2 AM pipeline unit #1 pipeline unit #2 segment #1segment #2 segment #3 segment #3 segment #1 segment #2 segment #4 segment #4 TARGET pack pack unpack unpack ORIGIN segment #3 segment #4 segment #1 segment #2 segment #3 segment #4 segment #1 segment #2 Contiguous output data layout Really straightforward and convenient? Out-of-order AMs require buffering or reordering Non-contiguous output data layout Require new API not a big deal Must transfer back count array Packing and unpacking No buffering or reordering is needed Data packing vs. data transmission Controlled by system-specific threshold 9

Experimental Settings BLUES cluster at ANL: 310 nodes, with each consisting 16 cores, connected with QLogic QDR InfiniBand Based on MPICH-3.1b1 Micro-benchmarks: two common operations Remote search of string sequences (20 characters per sequence) Remote summation of absolute values in two arrays (100 integers per array) Result data is returned 10

Effect of Exclusive User Buffer 200 1400 180 base-impl base-impl 1200 160 synchronization excl-lock-opt-impl excl-lock-opt-impl 140 1000 latency (us) latency (us) win-opt-impl win-opt-impl 120 800 100 600 80 60 400 40 200 20 0 0 100 600 1100 1600 100 600 1100 1600 number of segments in AM operation number of segments in AM operation 5000 4500 execution time (ms) basic-impl 4000 scalability: 25% improvement excl-lock-opt-impl 3500 3000 2500 2000 1500 1000 500 0 2 8 32 128 512 2048 11 number of processes

Comparison between MPIX_AM and MPIX_AMV 45040 MPIX_AM MPIX_AMV (1.0) MPIX_AMV (0.8) 40040 operation throughput (#ops/s) 35040 system-specific threshold helps eliminate packing and unpacking overhead 30040 25040 20040 15040 10040 5040 40 10% 30% 50% 70% 90% percentage of useful data per output segment 25040 transferred data size per AM 20040 MPIX_AMV(0.8) transmits more data than MPIX_AM due to additional counts array 15040 (bytes) 10040 MPIX_AM MPIX_AMV (1.0) MPIX_AMV (0.8) 5040 40 10% 30% 50% 70% 90% 12 percentage of useful data per output segment

Conclusion Data-intensive applications are increasingly important, MPI is not a well-suited model We proposed MPI-AM framework in our previous work, to make data-intensive applications more efficient and require less programming effort There are performance shortcomings in current MPI-AM framework Our optimization strategies, including auto-detected and user-defined methods, can effectively reduce synchronization overhead and improve efficiency of data transmission 13

Our previous work on MPI-AM: Towards Asynchronous and MPI-Interoperable Active Messages. Xin Zhao, Darius Buntinas, Judicael Zounmevo, James Dinan, David Goodell, Pavan Balaji, Rajeev Thakur, Ahmad Afsahi, William Gropp. In proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2013). MPI-Interoperable Generalized Active Messages Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur. In proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2013). They can be found at http://web.engr.illinois.edu/~xinzhao3/ More about MPI-3 RMA interface can be found in MPI-3 standard (http://www.mpi-forum.org/docs/) Thanks for your attention! 14

BACKUP

Data-Intensive Applications Traditional applications Organized around dense vectors or matrices Regular communication, use MPI SEND/RECV or collectives Communication-to-computation ratio is low Example: stencil computation, matrix multiplication, FFT Data-intensive applications Organized around graphs, sparse vectors Communication pattern is irregular and data-dependent Communication-to-computation ratio is high Example: bioinformatics, social network analysis 16

Vector Version of AM API MPIX_AMV IN IN IN OUT origin_output_addr IN origin_output_segment_count IN origin_output_datatype IN num_segments IN target_rank IN target_input_datatype IN target_persistent_disp IN target_persistent_count IN target_persistent_datatype IN target_output_datatype IN am_op IN win OUT output_segment_counts MPIX_AMV_USER_FUNCTION IN input_addr IN input_segment_count IN input_datatype INOUT persistent_addr INOUT persistent_count INOUT persistent_datatype OUT output_addr OUT output_segment_count OUT output_datatype IN num_segments IN segment_offset OUT output_segment_counts origin_input_addr origin_input_segment_count origin_input_datatype 17

Effect of Exclusive User Buffer 5000 4500 execution time (ms) basic-impl 4000 scalability: 25% improvement 3500 excl-lock-opt-impl 3000 2500 2000 1500 1000 500 0 2 8 32 128 512 2048 number of processes 2500 execution time (ms) basic-impl 2000 win-opt-impl (20MB) providing more exclusive buffer greatly reduces contention win-opt-impl (40MB) 1500 1000 500 0 2 8 32 128 512 2048 18 number of processes

Auto-detected Exclusive User Buffer origin Handle detaching user buffers One more hand-shake is required after MPI_WIN_FLUSH, because user buffer may be detached on target User can pass a hint to tell MPI that there will be no buffer detachment, in such case, MPI will eliminate hand-shake after MPI_WIN_FLUSH target lock EXCLUSIVE flush detach user buffer unlock 19

Background MPI One-sided Synchronization Modes Two synchronization modes origin target origin target post lock start access epoch exposure epoch epoch complete unlock wait active target mode (post-start-complete-wait/fence) passive target mode (SHARED or EXCLUSIVE lock/lock_all) 20

Optimization Strategies for MPI-Interoperable Active Messages

Download Presentation

Presentation Transcript

Related

More Related Content