Optimization Strategies for MPI-Interoperable Active Messages

 OPTIMIZATION STRATEGIES FOR
MPI-INTEROPERABLE ACTIVE MESSAGES
X
i
n
 
Z
h
a
o
,
 
P
a
v
a
n
 
B
a
l
a
j
i
,
 
W
i
l
l
i
a
m
 
G
r
o
p
p
,
 
R
a
j
e
e
v
 
T
h
a
k
u
r
University of Illinois at Urbana-Champaign
Argonne National Laboratory
ScalCom’13
December 21, 2013
2
Examples: graph algorithm (BFS), sequence assembly
Common characteristics
Organized around sparse
structures
Communication-to-computation
ratio is high
Irregular communication pattern
Remote complex computation
Data-Intensive Applications
MPI: industry standard communication runtime for high performance
computing
Message Passing Models
G
e
t
 
(
d
a
t
a
)
A
c
c
 
(
d
a
t
a
)
+=
S
e
n
d
 
(
d
a
t
a
)
R
e
c
e
i
v
e
 
(
d
a
t
a
)
two-sided communication
(explicit sends and receives)
one-sided (RMA) communication
(explicit sends, implicit receives, simple
remote operations)
3
Active Messages
Sender explicitly sends message
Upon message’s arrival, message handler is
triggered, receiver is not explicitly involved
User-defined operations on remote process
C
o
r
r
e
c
t
n
e
s
s
 
s
e
m
a
n
t
i
c
s
M
e
m
o
r
y
 
c
o
n
s
i
s
t
e
n
c
y
MPI runtime must ensure consistency of window
T
h
r
e
e
 
d
i
f
f
e
r
e
n
t
 
t
y
p
e
 
o
f
 
o
r
d
e
r
i
n
g
C
o
n
c
u
r
r
e
n
c
y
:
 
b
y
 
d
e
f
a
u
l
t
,
 
M
P
I
 
r
u
n
t
i
m
e
 
b
e
h
a
v
e
s
 
a
s
 
i
f
 
A
M
s
 
a
r
e
 
e
x
e
c
u
t
e
d
 
i
n
 
s
e
q
u
e
n
t
i
a
l
 
o
r
d
e
r
.
 
U
s
e
r
c
a
n
 
r
e
l
e
a
s
e
 
c
o
n
c
u
r
r
e
n
c
y
 
b
y
 
s
e
t
t
i
n
g
 
M
P
I
 
a
s
s
e
r
t
.
4
S
t
r
e
a
m
i
n
g
 
A
M
s
d
e
f
i
n
e
 
s
e
g
m
e
n
t
m
i
n
i
m
u
m
 
n
u
m
b
e
r
 
o
f
 
e
l
e
m
e
n
t
s
 
f
o
r
 
A
M
 
e
x
e
c
u
t
i
o
n
achieve pipeline effect and reduce buffer requirement
[ICPADS 2013]   X. Zhao, P. Balaji,  W. Gropp, R. Thakur, 
“MPI-Interoperable and Generalized Active Messages”
, in proceedings of ICPADS’ 13
E
x
p
l
i
c
i
t
 
a
n
d
 
i
m
p
l
i
c
i
t
 
b
u
f
f
e
r
 
m
a
n
a
g
e
m
e
n
t
u
s
e
r
 
b
u
f
f
e
r
s
:
 
r
e
n
d
e
z
v
o
u
s
 
p
r
o
t
o
c
o
l
,
 
g
u
a
r
a
n
t
e
e
 
c
o
r
r
e
c
t
 
e
x
e
c
u
t
i
o
n
s
y
s
t
e
m
 
b
u
f
f
e
r
s
:
 
e
a
g
e
r
 
p
r
o
t
o
c
o
l
,
 
n
o
t
 
a
l
w
a
y
s
 
e
n
o
u
g
h
M
P
I
-
A
M
 
w
o
r
k
f
l
o
w
S
E
P
A
R
A
T
E
w
i
n
d
o
w
 
m
o
d
e
l
U
N
I
F
I
E
D
w
i
n
d
o
w
 
m
o
d
e
l
M
P
I
-
A
M
:
 
a
n
 
M
P
I
-
i
n
t
e
r
o
p
e
r
a
b
l
e
 
f
r
a
m
e
w
o
r
k
 
t
h
a
t
 
c
a
n
 
d
y
n
a
m
i
c
a
l
l
y
m
a
n
a
g
e
 
d
a
t
a
 
m
o
v
e
m
e
n
t
 
a
n
d
 
u
s
e
r
-
d
e
f
i
n
e
d
 
r
e
m
o
t
e
 
c
o
m
p
u
t
a
t
i
o
n
.
m
e
m
o
r
y
 
b
a
r
r
i
e
r
m
e
m
o
r
y
 
b
a
r
r
i
e
r
Past Work: MPI-Interoperable Generalized AM
How MPI-Interoperable AMs work
unlock
origin
target
ACC(Z)
epoch
lock
PUT(X)
AM(X)
GET(Y)
process 0
process 1
GET (Y)
PUT (X)
process 2
AM (U)
GET (W)
AM (M)
fence
fence
fence
fence
PUT (N)
epoch
epoch
Leveraging MPI RMA interface
passive target mode
(EXCLUSIVE or SHARED lock)
active target mode
(Fence or Post-Start-Complete-Wait)
5
Performance Shortcomings with MPI-AM
Synchronization stalls in data
buffering
Inefficiency in data transmission
Effective strategies are needed to improve performance
System level optimization
User level optimization
6
Opt #1: Auto-Detected Exclusive User Buffer
MPI internally detects EXCLUSIVE passive mode
Optimization is transparent to user
unlock
origin
target
AM 2
epoch
lock
AM 1
EXCLUSIVE
hand-shake
unlock
origin
target
AM 2
epoch
lock
AM 1
SHARED
hand-shake
hand-shake
Only one hand-shake is
required for entire epoch
Hand-shake is required
whenever AM uses user buffer
7
User can define amount of user buffer guaranteed available for
certain processes
Beneficial for SHARED passive mode and active mode
No hand-shake operation is required!
Opt #2: User-Defined Exclusive User Buffer
8
Opt #3: Improving Data Transmission
Two different models
Contiguous output data layout
Really straightforward and
convenient?
Out-of-order AMs require
buffering or reordering
Non-contiguous output data layout
Require new API — not a big deal
Must transfer back count array
Packing and unpacking
No buffering or reordering is needed
Data packing vs. data transmission
Controlled by system-specific threshold
9
BLUES cluster at ANL: 310 nodes, with each consisting
16 cores, connected with QLogic QDR InfiniBand
Based on MPICH-3.1b1
Micro-benchmarks: two common operations
Remote search of string sequences (20 characters per
sequence)
Remote summation of absolute values in two arrays
(100 integers per array)
Result data is returned
10
10
Experimental Settings
11
11
scalability:
25% improvement
Effect of Exclusive User Buffer
12
12
system-specific threshold
helps eliminate packing
and unpacking overhead
MPIX_AMV(0.8) transmits
more data than MPIX_AM
due to additional counts array
Comparison between MPIX_AM and MPIX_AMV
Data-intensive applications are increasingly important,
MPI is not a well-suited model
We proposed MPI-AM framework in our previous work,
to make data-intensive applications more efficient and
require less programming effort
There are performance shortcomings in current MPI-AM
framework
Our optimization strategies, including auto-detected and
user-defined methods, can effectively reduce
synchronization overhead and improve efficiency of data
transmission
13
13
Conclusion
14
14
O
u
r
 
p
r
e
v
i
o
u
s
 
w
o
r
k
 
o
n
 
M
P
I
-
A
M
:
T
o
w
a
r
d
s
 
A
s
y
n
c
h
r
o
n
o
u
s
 
a
n
d
 
M
P
I
-
I
n
t
e
r
o
p
e
r
a
b
l
e
 
A
c
t
i
v
e
 
M
e
s
s
a
g
e
s
.
 
X
i
n
 
Z
h
a
o
,
 
D
a
r
i
u
s
 
B
u
n
t
i
n
a
s
,
 
J
u
d
i
c
a
e
l
Z
o
u
n
m
e
v
o
,
 
J
a
m
e
s
 
D
i
n
a
n
,
 
D
a
v
i
d
 
G
o
o
d
e
l
l
,
 
P
a
v
a
n
 
B
a
l
a
j
i
,
 
R
a
j
e
e
v
 
T
h
a
k
u
r
,
 
A
h
m
a
d
 
A
f
s
a
h
i
,
 
W
i
l
l
i
a
m
 
G
r
o
p
p
.
 
I
n
p
r
o
c
e
e
d
i
n
g
s
 
o
f
 
t
h
e
 
1
3
t
h
 
I
E
E
E
/
A
C
M
 
I
n
t
e
r
n
a
t
i
o
n
a
l
 
S
y
m
p
o
s
i
u
m
 
o
n
 
C
l
u
s
t
e
r
,
 
C
l
o
u
d
 
a
n
d
 
G
r
i
d
 
C
o
m
p
u
t
i
n
g
 
(
C
C
G
r
i
d
 
2
0
1
3
)
.
M
P
I
-
I
n
t
e
r
o
p
e
r
a
b
l
e
 
G
e
n
e
r
a
l
i
z
e
d
 
A
c
t
i
v
e
 
M
e
s
s
a
g
e
s
 
X
i
n
 
Z
h
a
o
,
 
P
a
v
a
n
 
B
a
l
a
j
i
,
 
W
i
l
l
i
a
m
 
G
r
o
p
p
,
 
R
a
j
e
e
v
 
T
h
a
k
u
r
.
 
I
n
p
r
o
c
e
e
d
i
n
g
s
 
o
f
 
t
h
e
 
1
9
t
h
 
I
E
E
E
 
I
n
t
e
r
n
a
t
i
o
n
a
l
 
C
o
n
f
e
r
e
n
c
e
 
o
n
 
P
a
r
a
l
l
e
l
 
a
n
d
 
D
i
s
t
r
i
b
u
t
e
d
 
S
y
s
t
e
m
s
 
(
I
C
P
A
D
S
 
2
0
1
3
)
.
T
h
e
y
 
c
a
n
 
b
e
 
f
o
u
n
d
 
a
t
 
h
t
t
p
:
/
/
w
e
b
.
e
n
g
r
.
i
l
l
i
n
o
i
s
.
e
d
u
/
~
x
i
n
z
h
a
o
3
/
M
o
r
e
 
a
b
o
u
t
 
M
P
I
-
3
 
R
M
A
 
i
n
t
e
r
f
a
c
e
 
c
a
n
 
b
e
 
f
o
u
n
d
 
i
n
 
M
P
I
-
3
 
s
t
a
n
d
a
r
d
(
h
t
t
p
:
/
/
w
w
w
.
m
p
i
-
f
o
r
u
m
.
o
r
g
/
d
o
c
s
/
)
Thanks for your attention! 
BACKUP
Data-Intensive Applications
16
16
Traditional
 applications
Organized around dense vectors
or matrices
Regular communication, use
MPI SEND/RECV or collectives
Communication-to-computation
ratio is low
Example: 
stencil computation,
matrix multiplication, FFT
Data-intensive applications
Organized around graphs, sparse
vectors
Communication pattern is irregular
and data-dependent
Communication-to-computation ratio
is high
Example: bioinformatics, social
network analysis
Vector Version of AM API
M
P
I
X
_
A
M
V
 
 
 
I
N
o
r
i
g
i
n
_
i
n
p
u
t
_
a
d
d
r
 
 
 
I
N
 
o
r
i
g
i
n
_
i
n
p
u
t
_
s
e
g
m
e
n
t
_
c
o
u
n
t
 
 
 
I
N
o
r
i
g
i
n
_
i
n
p
u
t
_
d
a
t
a
t
y
p
e
 
 
 
O
U
T
o
r
i
g
i
n
_
o
u
t
p
u
t
_
a
d
d
r
 
 
 
I
N
o
r
i
g
i
n
_
o
u
t
p
u
t
_
s
e
g
m
e
n
t
_
c
o
u
n
t
 
 
 
I
N
o
r
i
g
i
n
_
o
u
t
p
u
t
_
d
a
t
a
t
y
p
e
 
 
 
I
N
n
u
m
_
s
e
g
m
e
n
t
s
 
 
 
I
N
t
a
r
g
e
t
_
r
a
n
k
 
 
 
I
N
t
a
r
g
e
t
_
i
n
p
u
t
_
d
a
t
a
t
y
p
e
 
 
 
I
N
t
a
r
g
e
t
_
p
e
r
s
i
s
t
e
n
t
_
d
i
s
p
 
 
 
I
N
t
a
r
g
e
t
_
p
e
r
s
i
s
t
e
n
t
_
c
o
u
n
t
 
 
 
I
N
t
a
r
g
e
t
_
p
e
r
s
i
s
t
e
n
t
_
d
a
t
a
t
y
p
e
 
 
 
I
N
t
a
r
g
e
t
_
o
u
t
p
u
t
_
d
a
t
a
t
y
p
e
 
 
 
I
N
a
m
_
o
p
 
 
 
I
N
w
i
n
 
 
 
O
U
T
 
 
o
u
t
p
u
t
_
s
e
g
m
e
n
t
_
c
o
u
n
t
s
17
17
M
P
I
X
_
A
M
V
_
U
S
E
R
_
F
U
N
C
T
I
O
N
 
 
 
I
N
 
 
 
 
i
n
p
u
t
_
a
d
d
r
 
 
 
I
N
 
 
 
 
 
i
n
p
u
t
_
s
e
g
m
e
n
t
_
c
o
u
n
t
 
 
 
I
N
 
 
 
 
i
n
p
u
t
_
d
a
t
a
t
y
p
e
 
 
 
I
N
O
U
T
 
 
 
 
p
e
r
s
i
s
t
e
n
t
_
a
d
d
r
 
 
 
I
N
O
U
T
 
 
 
 
p
e
r
s
i
s
t
e
n
t
_
c
o
u
n
t
 
 
 
I
N
O
U
T
 
 
 
 
p
e
r
s
i
s
t
e
n
t
_
d
a
t
a
t
y
p
e
 
 
 
O
U
T
 
 
 
 
o
u
t
p
u
t
_
a
d
d
r
 
 
 
O
U
T
 
 
 
 
o
u
t
p
u
t
_
s
e
g
m
e
n
t
_
c
o
u
n
t
 
 
 
O
U
T
 
 
 
 
o
u
t
p
u
t
_
d
a
t
a
t
y
p
e
 
 
 
I
N
 
 
 
 
n
u
m
_
s
e
g
m
e
n
t
s
 
 
 
I
N
 
 
 
 
s
e
g
m
e
n
t
_
o
f
f
s
e
t
 
 
 
O
U
T
 
 
 
 
 
o
u
t
p
u
t
_
s
e
g
m
e
n
t
_
c
o
u
n
t
s
Effect of Exclusive User Buffer
18
18
s
c
a
l
a
b
i
l
i
t
y
:
25% improvement
providing more
exclusive buffer
greatly reduces
c
o
n
t
e
n
t
i
o
n
Auto-detected Exclusive User Buffer
 
Handle detaching user buffers
One more hand-shake is
required after
MPI_WIN_FLUSH, because
user buffer may be detached
on target
User can pass a hint to tell MPI
that there will be no buffer
detachment, in such case, MPI
will eliminate hand-shake after
MPI_WIN_FLUSH
19
19
unlock
origin
target
AM 2
lock
AM 1
EXCLUSIVE
hand-shake
flush
hand-shake
AM 4
AM 3
AM 5
detach user
buffer
Two synchronization modes
20
20
passive target mode
(SHARED or EXCLUSIVE lock/lock_all)
active target mode
(post-start-complete-wait/fence)
exposure
epoch
Background 
—MPI One-sided Synchronization Modes
Slide Note

15 minutes, 12 minutes + 3 minutes Q&A

`

Embed
Share

The study delves into optimization strategies for MPI-interoperable active messages, focusing on data-intensive applications like graph algorithms and sequence assembly. It explores message passing models in MPI, past work on MPI-interoperable and generalized active messages, and how MPI-interoperable active messages work by leveraging the MPI RMA interface.

  • MPI
  • Active Messages
  • Optimization Strategies
  • Data-Intensive Applications
  • Message Passing Models

Uploaded on Sep 21, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. OPTIMIZATION STRATEGIES FOR MPI-INTEROPERABLE ACTIVE MESSAGES Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur University of Illinois at Urbana-Champaign Argonne National Laboratory ScalCom 13 December 21, 2013

  2. Data-Intensive Applications Examples: graph algorithm (BFS), sequence assembly remote node ACGCGATTCAG GCGATTCAGTA ACGCGATTCAGTA remote search DNA consensus sequence local node Common characteristics Organized around sparse structures Communication-to-computation ratio is high Irregular communication pattern Remote complex computation 2

  3. Message Passing Models MPI: industry standard communication runtime for high performance computing Process 0 Process 1 Send (data) Receive (data) Send (data) Receive (data) two-sided communication (explicit sends and receives) Process 0 Put (data) Get (data) Acc (data) Process 1 += one-sided (RMA) communication (explicit sends, implicit receives, simple remote operations) origin target Active Messages Sender explicitly sends message Upon message s arrival, message handler is triggered, receiver is not explicitly involved User-defined operations on remote process messages handler reply handler 3

  4. Past Work: MPI-Interoperable Generalized AM origin input buffer origin output buffer MPI-AM: an MPI-interoperable framework that can dynamically manage data movement and user-defined remote computation. AM input data AM output data Streaming AMs define segment minimum number of elements for AM execution achieve pipeline effect and reduce buffer requirement Explicit and implicit buffer management target input buffer target output buffer AM handler user buffers: rendezvous protocol, guarantee correct execution system buffers: eager protocol, not always enough target persistent buffer MPI-AM workflow Correctness semantics Memory consistency MPI runtime must ensure consistency of window memory barrier memory barrier UNIFIED window model SEPARATE window model AM AM handler handler memory barrier memory barrier flush cache line back Three different type of ordering Concurrency: by default, MPI runtime behaves as if AMs are executed in sequential order. User can release concurrency by setting MPI assert. 4 [ICPADS 2013] X. Zhao, P. Balaji, W. Gropp, R. Thakur, MPI-Interoperable and Generalized Active Messages , in proceedings of ICPADS 13

  5. How MPI-Interoperable AMs work Leveraging MPI RMA interface origin target process 2 process 0 process 1 lock fence fence epoch epoch epoch unlock fence fence passive target mode (EXCLUSIVE or SHARED lock) active target mode (Fence or Post-Start-Complete-Wait) 5

  6. Performance Shortcomings with MPI-AM Synchronization stalls in data buffering Inefficiency in data transmission query DNA sequences ORIGIN TARGET ACGCGATTCAG GCGATTCAGTA ACGCGATTCAGT A remote node 2 remote node 1 segment 1 receive in system buffer remote search stall reserve user buffer AM on system buffer is finished AM segment 2 receive in user buffer receive in system buffer segment 3 local node consensus DNA sequences Effective strategies are needed to improve performance System level optimization User level optimization 6

  7. Opt #1: Auto-Detected Exclusive User Buffer MPI internally detects EXCLUSIVE passive mode origin target origin target lock lock EXCLUSIVE SHARED epoch epoch unlock unlock Only one hand-shake is required for entire epoch Hand-shake is required whenever AM uses user buffer Optimization is transparent to user 7

  8. Opt #2: User-Defined Exclusive User Buffer User can define amount of user buffer guaranteed available for certain processes Beneficial for SHARED passive mode and active mode No hand-shake operation is required! process 0 process 2 win_create process 1 rank size win_create win_create 0 2 20 MB 30 MB fence fence fence fence 8

  9. Opt #3: Improving Data Transmission Two different models AM pipeline unit #1 pipeline unit #2 AM pipeline unit #1 pipeline unit #2 segment #1segment #2 segment #3 segment #3 segment #1 segment #2 segment #4 segment #4 TARGET pack pack unpack unpack ORIGIN segment #3 segment #4 segment #1 segment #2 segment #3 segment #4 segment #1 segment #2 Contiguous output data layout Really straightforward and convenient? Out-of-order AMs require buffering or reordering Non-contiguous output data layout Require new API not a big deal Must transfer back count array Packing and unpacking No buffering or reordering is needed Data packing vs. data transmission Controlled by system-specific threshold 9

  10. Experimental Settings BLUES cluster at ANL: 310 nodes, with each consisting 16 cores, connected with QLogic QDR InfiniBand Based on MPICH-3.1b1 Micro-benchmarks: two common operations Remote search of string sequences (20 characters per sequence) Remote summation of absolute values in two arrays (100 integers per array) Result data is returned 10

  11. Effect of Exclusive User Buffer 200 1400 180 base-impl base-impl 1200 160 synchronization excl-lock-opt-impl excl-lock-opt-impl 140 1000 latency (us) latency (us) win-opt-impl win-opt-impl 120 800 100 600 80 60 400 40 200 20 0 0 100 600 1100 1600 100 600 1100 1600 number of segments in AM operation number of segments in AM operation 5000 4500 execution time (ms) basic-impl 4000 scalability: 25% improvement excl-lock-opt-impl 3500 3000 2500 2000 1500 1000 500 0 2 8 32 128 512 2048 11 number of processes

  12. Comparison between MPIX_AM and MPIX_AMV 45040 MPIX_AM MPIX_AMV (1.0) MPIX_AMV (0.8) 40040 operation throughput (#ops/s) 35040 system-specific threshold helps eliminate packing and unpacking overhead 30040 25040 20040 15040 10040 5040 40 10% 30% 50% 70% 90% percentage of useful data per output segment 25040 transferred data size per AM 20040 MPIX_AMV(0.8) transmits more data than MPIX_AM due to additional counts array 15040 (bytes) 10040 MPIX_AM MPIX_AMV (1.0) MPIX_AMV (0.8) 5040 40 10% 30% 50% 70% 90% 12 percentage of useful data per output segment

  13. Conclusion Data-intensive applications are increasingly important, MPI is not a well-suited model We proposed MPI-AM framework in our previous work, to make data-intensive applications more efficient and require less programming effort There are performance shortcomings in current MPI-AM framework Our optimization strategies, including auto-detected and user-defined methods, can effectively reduce synchronization overhead and improve efficiency of data transmission 13

  14. Our previous work on MPI-AM: Towards Asynchronous and MPI-Interoperable Active Messages. Xin Zhao, Darius Buntinas, Judicael Zounmevo, James Dinan, David Goodell, Pavan Balaji, Rajeev Thakur, Ahmad Afsahi, William Gropp. In proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2013). MPI-Interoperable Generalized Active Messages Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur. In proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2013). They can be found at http://web.engr.illinois.edu/~xinzhao3/ More about MPI-3 RMA interface can be found in MPI-3 standard (http://www.mpi-forum.org/docs/) Thanks for your attention! 14

  15. BACKUP

  16. Data-Intensive Applications Traditional applications Organized around dense vectors or matrices Regular communication, use MPI SEND/RECV or collectives Communication-to-computation ratio is low Example: stencil computation, matrix multiplication, FFT Data-intensive applications Organized around graphs, sparse vectors Communication pattern is irregular and data-dependent Communication-to-computation ratio is high Example: bioinformatics, social network analysis 16

  17. Vector Version of AM API MPIX_AMV IN IN IN OUT origin_output_addr IN origin_output_segment_count IN origin_output_datatype IN num_segments IN target_rank IN target_input_datatype IN target_persistent_disp IN target_persistent_count IN target_persistent_datatype IN target_output_datatype IN am_op IN win OUT output_segment_counts MPIX_AMV_USER_FUNCTION IN input_addr IN input_segment_count IN input_datatype INOUT persistent_addr INOUT persistent_count INOUT persistent_datatype OUT output_addr OUT output_segment_count OUT output_datatype IN num_segments IN segment_offset OUT output_segment_counts origin_input_addr origin_input_segment_count origin_input_datatype 17

  18. Effect of Exclusive User Buffer 5000 4500 execution time (ms) basic-impl 4000 scalability: 25% improvement 3500 excl-lock-opt-impl 3000 2500 2000 1500 1000 500 0 2 8 32 128 512 2048 number of processes 2500 execution time (ms) basic-impl 2000 win-opt-impl (20MB) providing more exclusive buffer greatly reduces contention win-opt-impl (40MB) 1500 1000 500 0 2 8 32 128 512 2048 18 number of processes

  19. Auto-detected Exclusive User Buffer origin Handle detaching user buffers One more hand-shake is required after MPI_WIN_FLUSH, because user buffer may be detached on target User can pass a hint to tell MPI that there will be no buffer detachment, in such case, MPI will eliminate hand-shake after MPI_WIN_FLUSH target lock EXCLUSIVE flush detach user buffer unlock 19

  20. Background MPI One-sided Synchronization Modes Two synchronization modes origin target origin target post lock start access epoch exposure epoch epoch complete unlock wait active target mode (post-start-complete-wait/fence) passive target mode (SHARED or EXCLUSIVE lock/lock_all) 20

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#