Leveraging MPI's One-Sided Communication Interface for Shared Memory Programming

EVERAGING

 MPI’

NE

-S

IDED

OMMUNICATION

NTERFACE

FOR

HARED

-M

EMORY

ROGRAMMING

Torsten Hoefler, James Dinan, Darius Buntinas,

Pavan Balaji, Brian Barrett, Ron Brightwell,

William Gropp, Vivek Kale, Rajeev Thakur



Multi- and manycore is ubiquitous



They offer shared memory that allows:

1.

Sharing of data structures



Reduce copies/effective memory consumption

NUMA accesses

2.

Fast in-memory communication



May be faster than MPI

Performance model is very complex

Slide 2 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



MPI offers shared memory optimizations



But real zero copy is impossible



MPI+X to utilize shared memory



X={OpenMP, pthreads, UPC …}



Complex interactions between models



Deadlocks possible



Race conditions made easy



Slowdown due to higher MPI thread level



Requirements are often simple



Switching programming models not necessary?

Slide 3 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



One may use POSIX shm calls to create shared

memory segments?



Several issues:

1.

Allocation is not collective and users

would  have to deal with NUMA intricacies

2.

Cleanup of shm regions is problematic in the

presence of abnormal termination

3.

MPI’s interface allows easy support for

debuggers and performance tools

Slide 4 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



MPI offers two memory models:



Unified: public and private window are identical



Separate: public and private window are separate



Type is attached as attribute to window



MPI_WIN_MODEL

MPI_UNIFIED

MPI_SEPARATE

Slide 5 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



MPI_WIN_ALLOCATE_SHARED(



size - in Bytes (of calling process)



disp_unit - addressing offset in Bytes



info - specify optimization hints



comm - input communicator



baseptr - returned pointer



win – returned window



The creation call is collective

Slide 6 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



All processes in comm must be in shared

memory



Fulfilling the unified window requirements



Each process gets a pointer to its segment



Does not know other processes’ pointer



Query function:



MPI_WIN_SHARED_QUERY(win, rank, size,

disp_unit, baseptr)



Query rank’s size, disp_unit, and baseptr

Slide 7 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



Or: “How do I know which processes share

memory”?



MPI_COMM_SPLIT_TYPE(comm, split_type,

key, info, newcomm)



split_type = MPI_COMM_TYPE_SHARED



Splits communicator into maximum shared

memory islands



Portable

Slide 8 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

Slide 9 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



“Principle of least surprise” (default)



Memory is consecutive across ranks



Allows for inter-rank address calculations



i.e., rank i’s first Byte starts right after rank i-1’s

last Byte

•

“Optimizations allowed”

•

Specify info “alloc_shared_noncontig”

•

May create non-contiguous regions

•

Must use win_shared_query

Slide 10 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



Contiguous (default)



Reduce

 total size to rank 0



Rank 0 creates shared memory segment



Broadcast

 address and key



Exscan

 to get local offset



O(log P) time and O(P) total storage

Slide 11 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



Noncontiguous (

specify alloc_shared_noncontig



Option 1:



Each rank creates his own segment



Option 2:



Rank 0 creates one segment but

pads to page boundaries

Slide 12 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

1.

Share data structures



Use hybrid programming where it is efficient



E.g., OpenMP at the loop level



Have MPI processes share common memory



Retain all MPI features, e.g., collective etc.

2.

Improve communication performance



Enables direct access to “remote” data



No need for halo zones (but they often help!)



True zero copy in this sense

Slide 13 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



Two fundamental benefits:

1.

Avoid tag matching and MPI stack

2.

Avoid expensive fine-grained synchronization



Full interface implemented in Open MPI and

MPICH2



Similar implementation and performance



Evaluated on 2.2 GHz AMD Opteron



Six cores

Slide 14 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



NxN grid decomposed in 2D



Dims_create, cart_create, isend/irecv, waitall

Slide 15 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



30-60% lower communication overhead!

Slide 16 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



The whole array is allocated in shared memory



Significant impact of alloc_shared_noncontig

Slide 17 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



MPI-3.0 offers support for shared memory



Ratified last week, standard available online



MPICH2 as well as Open MPI implement the

complete interface



Should be in official releases soon



We demonstrated two use-cases



Showed application speedup for a simple code



Performance may vary (depends on architecture)

Slide 18 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



MPI is (more) ready for multicore!



Supports coherent shared memory



Offers easy-to-use and portable interface



Mix&match with other MPI functions



We plan to evaluate



Different use-cases and applications



The Forum continues discussion



Non-coherent shared memory?

Slide 19 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur



The MPI Forum



Especially the RMA working group!

Slide 20 of 20

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

Slide Note

Embed Share

Download

This content discusses the utilization of MPI's one-sided communication interface for shared memory programming, addressing the benefits of using multi- and manycore systems, challenges in programming shared memory efficiently, the differences between MPI and OS tools, MPI-3.0 one-sided memory models, and the process of creating a shared memory window with MPI. It highlights the advantages and complexities associated with shared memory programming in parallel computing environments.

jmcgl Follow

Uploaded on Sep 27, 2024 | 3 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

LEVERAGING MPIS ONE-SIDED COMMUNICATION INTERFACE FOR SHARED-MEMORY PROGRAMMING Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, Rajeev Thakur

THE SHARED MEMORY REALITY Multi- and manycore is ubiquitous They offer shared memory that allows: 1. Sharing of data structures Reduce copies/effective memory consumption x NUMA accesses 2. Fast in-memory communication May be faster than MPI x Performance model is very complex Slide 2 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

STATEOFTHE ART PROGRAMMING MPI offers shared memory optimizations But real zero copy is impossible MPI+X to utilize shared memory X={OpenMP, pthreads, UPC } Complex interactions between models Deadlocks possible Race conditions made easy Slowdown due to higher MPI thread level Requirements are often simple Switching programming models not necessary? Slide 3 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

WHYNOTJUSTUSE OS TOOLS? One may use POSIX shm calls to create shared memory segments? Several issues: 1. Allocation is not collective and users would have to deal with NUMA intricacies 2. Cleanup of shm regions is problematic in the presence of abnormal termination 3. MPI s interface allows easy support for debuggers and performance tools Slide 4 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

MPI-3.0 ONE SIDED MEMORY MODELS MPI offers two memory models: Unified: public and private window are identical Separate: public and private window are separate Type is attached as attribute to window MPI_WIN_MODEL MPI_UNIFIED MPI_SEPARATE Slide 5 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

CREATINGA SHARED MEMORY WINDOW MPI_WIN_ALLOCATE_SHARED( size - in Bytes (of calling process) disp_unit - addressing offset in Bytes info - specify optimization hints comm - input communicator baseptr - returned pointer win returned window ) The creation call is collective Slide 6 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

HOWDO I USEIT? All processes in comm must be in shared memory Fulfilling the unified window requirements Each process gets a pointer to its segment Does not know other processes pointer Query function: MPI_WIN_SHARED_QUERY(win, rank, size, disp_unit, baseptr) Query rank s size, disp_unit, and baseptr Slide 7 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

CREATINGSHAREDMEMORYCOMMUNICATORS Or: How do I know which processes share memory ? MPI_COMM_SPLIT_TYPE(comm, split_type, key, info, newcomm) split_type = MPI_COMM_TYPE_SHARED Splits communicator into maximum shared memory islands Portable Slide 8 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

SCHEMATIC OVERVIEW Slide 9 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

MEMORY LAYOUT Principle of least surprise (default) Memory is consecutive across ranks Allows for inter-rank address calculations i.e., rank i s first Byte starts right after rank i-1 s last Byte Optimizations allowed Specify info alloc_shared_noncontig May create non-contiguous regions Must use win_shared_query Slide 10 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

IMPLEMENTATION OPTIONS I Contiguous (default) Reduce total size to rank 0 Rank 0 creates shared memory segment Broadcast address and key Exscan to get local offset O(log P) time and O(P) total storage Slide 11 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

IMPLEMENTATION OPTIONS II Noncontiguous (specify alloc_shared_noncontig) Option 1: Each rank creates his own segment Option 2: Rank 0 creates one segment but pads to page boundaries Slide 12 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

USE CASES 1. Share data structures Use hybrid programming where it is efficient E.g., OpenMP at the loop level Have MPI processes share common memory Retain all MPI features, e.g., collective etc. 2. Improve communication performance Enables direct access to remote data No need for halo zones (but they often help!) True zero copy in this sense Slide 13 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

FASTSHAREDMEMORY COMMUNICATION Two fundamental benefits: 1. Avoid tag matching and MPI stack 2. Avoid expensive fine-grained synchronization Full interface implemented in Open MPI and MPICH2 Similar implementation and performance Evaluated on 2.2 GHz AMD Opteron Six cores Slide 14 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

FIVE-POINT STENCIL EXAMPLE NxN grid decomposed in 2D Dims_create, cart_create, isend/irecv, waitall Slide 15 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

COMMUNICATION TIMES 30-60% lower communication overhead! Slide 16 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

NUMA EFFECTS? The whole array is allocated in shared memory Significant impact of alloc_shared_noncontig Slide 17 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

SUMMARY MPI-3.0 offers support for shared memory Ratified last week, standard available online MPICH2 as well as Open MPI implement the complete interface Should be in official releases soon We demonstrated two use-cases Showed application speedup for a simple code Performance may vary (depends on architecture) Slide 18 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

CONCLUSIONS & FUTURE WORK MPI is (more) ready for multicore! Supports coherent shared memory Offers easy-to-use and portable interface Mix&match with other MPI functions We plan to evaluate Different use-cases and applications The Forum continues discussion Non-coherent shared memory? Slide 19 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

ACKNOWLEDGMENTS The MPI Forum Especially the RMA working group! Slide 20 of 20 T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, R. Thakur

Leveraging MPI's One-Sided Communication Interface for Shared Memory Programming

Download Presentation

Presentation Transcript

Related

More Related Content