Challenges and Feedback from 2009 Sonoma MPI Community Sessions

 
(Old) MPI Community feedback
 
Jeff Squyres
Cisco Systems
 
Old Community Feedback
 
This presentation is union of:
2009 Sonoma MPI community feedback
2010 Sonoma MPI panel: Cisco + Microsoft
 
In the process of gathering current MPI
community feedback now
That will be a separate presentation
 
2009 Sonoma Workshop MPI Panel
 
Collected feedback from major commercial
verbs-based MPI implementations at the time
Coming from a viewpoint of “how to make
verbs 
not suck
 better”
 
Put this in the context of 2009
Some of this content discussed is different now
Some is not
 
OpenFabrics Feedback
 
Summary of identified challenges from
Open MPI
(including Sun ClusterTools)
HP MPI
Intel MPI
Platform MPI
Items are listed in priority order
 marked items are suggestions / requests
  
 marked items are shared by at least 2 MPI
s
 
OpenFabrics Feedback
 
1.
Memory registration: painful, dangerous
2.
fork() support inadequate
3.
Connection setup scalability problematic
4.
Relaxed ordering verbs API support
5.
[Lack of] API portability
6.
Need reliable connectionless (scalability)
7.
Need better S/R registered memory utilztn.
8.
CMs are waaaay too complex
 
1. Memory Registration
 
Must be creative to deal with MR slowness
Pipelining, caching, etc.
Dangerous 
tricks to catch free, munmap, sbrk
 
Notify when virt. 
/ phys. mapping changes
Pete Wyckoff proposed a way years ago
Other methods have also been discussed
 Make MR / MD faster and better
Can MR / registration cache be hidden?
Can MR go away? (and still keep high perf.)
 
2. fork() Support
 
Still problematic
(Partial) Pages registered in parent not in child
Unfortunately, 
many
 user codes call fork()
Child dup’s open device fd’s upon fork()
Parent exits, connections should close
…but they don
t because the device is still open
 Child memory after fork() should behave
exactly as it does without registered memory
 
Device fd’s should be set to 
close on exec
 
3. Connection Setup Scalability
 
All to all
 RC connections at scale
Current in-band mechanisms do not scale
IB SM: cannot handle N
2
 path record lookups
iWARP: preventing ARP floods requires extra setup
Requires out-of-band MPI info exchange
 Need in-band, scalable CM
Should work 
out of the box
Should not require workarounds: dumping SM
tables to files, preloading ARP caches, etc.
 
4. Relaxed Ordering API Support
 
Platforms now using relaxed PCI ordering
for memory bandwidth optimization
Precludes MPI’s 
eager RDMA
 optimization
Can’t poll memory to know when xfer
complete
30%+ latency difference to use send/receive
 Simple verbs API change
Specify whether to use strict/relaxed ordering
for individual memory registrations
 
5. Stack / API Portability
 
Windows API very different than Linux API
Solaris API 
close
 to Linux API
OMPI uses side effects to determine portability
 Standardize 1 API for all platforms / OSs
Standardize way for per-platform extensions
 Keep slow, well-announced ABI changes
Innovation is good, ISVs need time to react
MPI needs to be able to adapt at run-time
 
6. Reliable Connectionless
 
Connection-oriented does not scale
N
2
 use of resources
XRC helps, but is quite complex
Datagram support in general is lacking
Mellanox HCAs need 2 UD QPs to get full BW
 Need 
RD (or something like it)
Must still maintain RC-like high performance
Mix of hardware+middleware might work (MX)
 
7. S/R Registered Memory Utilization
 
S/R receiver buffer utilization can be poor
Hard to balance resource consumption
(memory) between MPI and application
Frequent app complaint: 
MPI takes too
much memory
Fixed size receive buffers ignore actual
incoming size
 Post a 
slab
 of memory for receives
Pack received messages more efficiently
 
8. CM’s (Far) Too Complex
 
Effectively requires a progress thread for
incoming connections
Or MPI implements all the timeout/retry code
Significant ULP complexity required
OMPI: 2,800 LOC just for RDMA CM
OMPI: 2,300 LOC for 
all of MX
 Want a
 higher-level API
Middleware, kernel – doesn
t matter
Simple non-blocking connect and accept
Handle all connection progression
 
2010 OF Sonoma Workshop MPI Panel
Cisco + Microsoft presentation
 
Panel: how to make take verbs to exascale
“Updates on MPI’s and Exascale”
 
Fab Tillier/Microsoft and I were chatting about
what we were going to say
Turns out we were going to say the same things
So we combined into a single presentation
 
MPI and the Exascale
 
Jeffrey M. Squyres, Cisco Systems
Fabian Tillier, Microsoft
 
www.openfabrics.org
 
15
 
16 March 2010
 
Our scope: MPI
 
Leave hardware, power, runtime systems,
filesystems, etc. to others
We’re MPI + software wonks
 
Assume: we know little about what a lottaflops
system will look like (yet)
So let’s make further assumptions
If you’re looking at these slides in 5 years, please try not
to laugh if we ended up being dreadfully wrong
 
16
 
www.openfabrics.org
 
Assumptions
 
Assume: MPI will be used in some way
Otherwise this panel would be meaningless 
Either:
Directly in applications as today
Underlying transport for something else (PGAS, etc.)
MPI has spent 15+ years optimizing parallel
communications; it would be silly to throw it away
Assume: system will be a hierarchy of some kind
Memory, processors, network
MPI will need to understand the topology – particularly
for collective operations (broadcast, reductions, etc.)
 
17
 
www.openfabrics.org
 
Assumptions
 
Assume: limited resources for each MPI process
Memory, network buffers, etc.
Cannot store O(N) information
Network may therefore need to be “smarter”
Assume: there will be failures
MPI needs to get at least as reliable as sockets
Meaning: MPI implementation has to survive network failures
MPI-3 standard effort is examining such issues
To include what this means to MPI applications
 
18
 
www.openfabrics.org
 
“Thin” MPI
 
Network must be reliable and connectionless
MPI should not handle tracking and retransmits
Runtime system must support MPI:
Locally tell MPI location/topology, peer, and network info
Route stdin/out/err, stage files, etc.
Some better systems today behave like this
Resources dedicated to MPI collective support
Maybe: network hardware, cores, memory, …?
Asynchronous progress seems critical
More than just multicast: think about MPI_Alltoall
 
www.openfabrics.org
 
19
 
Will it run OFED (verbs)?
 
We don’t know what the network hardware will be
If it runs verbs, there is much work to be done
Performant reliable connectionless will be necessary
Therefore: connection setup complexity disappears
Memory registration must disappear
Or get much better (no software reg cache, no dereg intercept)
Some form of MPI collective assist must be available
Learn from Intel, Cray, Quadrics, Voltaire, Mellanox, etc.
Export network / processor / memory / etc. topology info
Topology-aware algorithms will be critical
Separate header and data on completions
 
www.openfabrics.org
 
20
Slide Note
Embed
Share

Collected feedback from major commercial MPI implementations in 2009 addressing challenges such as memory registration, inadequate support for fork(), and problematic connection setup scalability. Suggestions were made to improve APIs, enhance memory registration methods, and simplify connection management for better performance and reliability.

  • MPI
  • Community
  • Feedback
  • Challenges
  • 2009

Uploaded on Oct 01, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. (Old) MPI Community feedback Jeff Squyres Cisco Systems

  2. Old Community Feedback This presentation is union of: 2009 Sonoma MPI community feedback 2010 Sonoma MPI panel: Cisco + Microsoft In the process of gathering current MPI community feedback now That will be a separate presentation

  3. 2009 Sonoma Workshop MPI Panel Collected feedback from major commercial verbs-based MPI implementations at the time Coming from a viewpoint of how to make verbs not suck better Put this in the context of 2009 Some of this content discussed is different now Some is not

  4. OpenFabrics Feedback Summary of identified challenges from Open MPI (including Sun ClusterTools) HP MPI Intel MPI Platform MPI Items are listed in priority order marked items are suggestions / requests marked items are shared by at least 2 MPI s Sonoma OpenFabrics Workshop, March 2009 Slide 4

  5. OpenFabrics Feedback 1. Memory registration: painful, dangerous 2. fork() support inadequate 3. Connection setup scalability problematic 4. Relaxed ordering verbs API support 5. [Lack of] API portability 6. Need reliable connectionless (scalability) 7. Need better S/R registered memory utilztn. 8. CMs are waaaay too complex Sonoma OpenFabrics Workshop, March 2009 Slide 5

  6. 1. Memory Registration Must be creative to deal with MR slowness Pipelining, caching, etc. Dangerous tricks to catch free, munmap, sbrk Notify when virt. / phys. mapping changes Pete Wyckoff proposed a way years ago Other methods have also been discussed Make MR / MD faster and better Can MR / registration cache be hidden? Can MR go away? (and still keep high perf.) Sonoma OpenFabrics Workshop, March 2009 Slide 6

  7. 2. fork() Support Still problematic (Partial) Pages registered in parent not in child Unfortunately, many user codes call fork() Child dup s open device fd s upon fork() Parent exits, connections should close but they don t because the device is still open Child memory after fork() should behave exactly as it does without registered memory Device fd s should be set to close on exec Sonoma OpenFabrics Workshop, March 2009 Slide 7

  8. 3. Connection Setup Scalability All to all RC connections at scale Current in-band mechanisms do not scale IB SM: cannot handle N2path record lookups iWARP: preventing ARP floods requires extra setup Requires out-of-band MPI info exchange Need in-band, scalable CM Should work out of the box Should not require workarounds: dumping SM tables to files, preloading ARP caches, etc. Sonoma OpenFabrics Workshop, March 2009 Slide 8

  9. 4. Relaxed Ordering API Support Platforms now using relaxed PCI ordering for memory bandwidth optimization Precludes MPI s eager RDMA optimization Can t poll memory to know when xfer complete 30%+ latency difference to use send/receive Simple verbs API change Specify whether to use strict/relaxed ordering for individual memory registrations Sonoma OpenFabrics Workshop, March 2009 Slide 9

  10. 5. Stack / API Portability Windows API very different than Linux API Solaris API close to Linux API OMPI uses side effects to determine portability Standardize 1 API for all platforms / OSs Standardize way for per-platform extensions Keep slow, well-announced ABI changes Innovation is good, ISVs need time to react MPI needs to be able to adapt at run-time Sonoma OpenFabrics Workshop, March 2009 Slide 10

  11. 6. Reliable Connectionless Connection-oriented does not scale N2use of resources XRC helps, but is quite complex Datagram support in general is lacking Mellanox HCAs need 2 UD QPs to get full BW Need RD (or something like it) Must still maintain RC-like high performance Mix of hardware+middleware might work (MX) Sonoma OpenFabrics Workshop, March 2009 Slide 11

  12. 7. S/R Registered Memory Utilization S/R receiver buffer utilization can be poor Hard to balance resource consumption (memory) between MPI and application Frequent app complaint: MPI takes too much memory Fixed size receive buffers ignore actual incoming size Post a slab of memory for receives Pack received messages more efficiently Sonoma OpenFabrics Workshop, March 2009 Slide 12

  13. 8. CMs (Far) Too Complex Effectively requires a progress thread for incoming connections Or MPI implements all the timeout/retry code Significant ULP complexity required OMPI: 2,800 LOC just for RDMA CM OMPI: 2,300 LOC for all of MX Want a higher-level API Middleware, kernel doesn t matter Simple non-blocking connect and accept Handle all connection progression Sonoma OpenFabrics Workshop, March 2009 Slide 13

  14. 2010 OF Sonoma Workshop MPI Panel Cisco + Microsoft presentation Panel: how to make take verbs to exascale Updates on MPI s and Exascale Fab Tillier/Microsoft and I were chatting about what we were going to say Turns out we were going to say the same things So we combined into a single presentation

  15. MPI and the Exascale Jeffrey M. Squyres, Cisco Systems Fabian Tillier, Microsoft 16 March 2010 15 www.openfabrics.org

  16. Our scope: MPI Leave hardware, power, runtime systems, filesystems, etc. to others We re MPI + software wonks Assume: we know little about what a lottaflops system will look like (yet) So let s make further assumptions If you re looking at these slides in 5 years, please try not to laugh if we ended up being dreadfully wrong 16 www.openfabrics.org

  17. Assumptions Assume: MPI will be used in some way Otherwise this panel would be meaningless Either: Directly in applications as today Underlying transport for something else (PGAS, etc.) MPI has spent 15+ years optimizing parallel communications; it would be silly to throw it away Assume: system will be a hierarchy of some kind Memory, processors, network MPI will need to understand the topology particularly for collective operations (broadcast, reductions, etc.) 17 www.openfabrics.org

  18. Assumptions Assume: limited resources for each MPI process Memory, network buffers, etc. Cannot store O(N) information Network may therefore need to be smarter Assume: there will be failures MPI needs to get at least as reliable as sockets Meaning: MPI implementation has to survive network failures MPI-3 standard effort is examining such issues To include what this means to MPI applications 18 www.openfabrics.org

  19. Thin MPI Network must be reliable and connectionless MPI should not handle tracking and retransmits Runtime system must support MPI: Locally tell MPI location/topology, peer, and network info Route stdin/out/err, stage files, etc. Some better systems today behave like this Resources dedicated to MPI collective support Maybe: network hardware, cores, memory, ? Asynchronous progress seems critical More than just multicast: think about MPI_Alltoall 19 www.openfabrics.org

  20. Will it run OFED (verbs)? We don t know what the network hardware will be If it runs verbs, there is much work to be done Performant reliable connectionless will be necessary Therefore: connection setup complexity disappears Memory registration must disappear Or get much better (no software reg cache, no dereg intercept) Some form of MPI collective assist must be available Learn from Intel, Cray, Quadrics, Voltaire, Mellanox, etc. Export network / processor / memory / etc. topology info Topology-aware algorithms will be critical Separate header and data on completions 20 www.openfabrics.org

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#