Exploring Distributed Solvers for Scalable Computing in UG
This project discusses the use of distributed solvers in UG to enable multi-rank MPI-based solvers with varying sizes, addressing the need for scalable solver codes and dynamic resource allocation. It introduces the UG solver interface, revisits the Concorde solver for TSP problems, and explores running Concorde on HPC using Cray cluster compatibility mode. The goal is to enhance performance and resource utilization for complex computational tasks.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Using distributed solvers in UG Utz-Uwe Haus, Yuji Shinano This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No 773897.
Vision Enable UG to Use multi-rank MPI-based solvers with varying size over the course of the run possibly from heterogeneous set of solvers Why? Single-rank may simply not be enough Scalable solver codes may realize a subproblem needs more resources and want to be restarted in a different size Not all subproblems readily decompose into new nodes for UG to collect May permit to alleviate resource shortage (memory!) of a subsolver Experiments with dynamic solver resource allocation API design for the solver library used in
UG solver interface Traditionally every MPI-parallelized solver subclasses class UG::ParaCommMpi Alternative: class UGS::UgsParaCommMpi UGS::UgsParaCommMpi encapsulates world, solver, and master+solver communicators Feature in UG 0.8.6, use UGS=true in Makefile Used in MPMD multi-solver setup Used in parallel heuristic code ugs_pacs_xpress Can even run UG solvers as sub-solvers Other solvers currently do not make use of multiple ranks per solver
Revisiting an old, new solver: Concorde Still one of the best TSP-codes D. Applegate, R. Bixby, V. Chv tal, and W. Cook http://www.math.uwaterloo.ca/tsp/index.html Written in C Pretty clean code base LP solver can be cplex or qsopt Needs good starting solutions for harder problems Column generation relies on never generating more than 231 columns Can be built in a TCP-socket based multi-node version Hard to use, not documented http://www.math.uwaterloo.ca/tsp/index.html
Concorde on HPC Using Cray cluster- compatibility-mode A Cray module that permits allocation of nodes to behave like a cluster Each process ( boss , grunt , ) needs to be started individually No way we could reasonably script this to become a UG solver class Directly after porting to MPI Simple MPI rank based process partitioning: Dedicate rank 0 to boss process Dedicate remaining ranks to be grunt (=worker ) processes Could also dedicate cut server or heuristic nodes Replace TCP communication by MPI rank-to-rank Send/Recv Low communication demands (size and frequency)
UG[Concorde/MPI,MPI] UG solver talks to rank 0 ( boss ) of Concorde/MPI Random seed has large influence on Concorde runs Multiple different seeds from UG are a good heuristic to get started Concorde starts with columns generation on rank 0, only uses additional ranks when doing B&B We currently have no good way of guessing the right process size a priory Need to implement a feedback mechanism to UG to ask for restart this subproblem with more B&B nodes Needs to be weighted against the option of simply passing back all subproblems to UG and running single-rank Concorde
State and Goal Goal UG orchestrating multiple Concorde/MPI instances, cut servers, heuristic instances Starting small runs Collecting nodes Handling Could use more ranks messages, dynamically re-running such nodes Data exchanged using universal data exchange library, not Files or explicit MPI UDJ, a CERL project, developed within Today Can run single-rank Concorde inside UG Files with node and cut information used to transfer subproblem information Concurrency issues, file naming issues, IO bottleneck Can run Concorde/MPI with hundreds of ranks Concorde slow to generate subproblems TSPlib problems mostly solved in root Integration ongoing
The dynamic MPI worker resource allocation problem Master idle ranks Master Solver 1 Solver 2 idle ranks
The MPI worker resource allocation problem Master Solver 1 Solver 2 idle ranks Issues: Useful size estimation Fragmentation of free ranks Luckily MPI does not care but rank placement influences performance Solver 2 terminates, Solver 3 started: Master Solver 3 idle ranks Solver 1 idle ranks
The gory parts: Dynamic MPI groups Most people use MPI in one of the following modes MPI_COMM_WORLD MPI provides a way for all ranks to communicate Master (rank 0)/worker (rank!=0) paradigm like in UG[*,MPI] Distributed data, rank number carries index information MPI_Comm_split() Application splits MPI_COMM_WORLD at program start to create multiple (independent) multi-rank applications Each group used like above Groups can talk to each other through original MPI_COMM_WORLD or (less common) through MPI-DPM MPI_Group_create() Application creates groups of logically related ranks, one rank may be in multiple groups E.g., distributed tensorial data with groups for slices sharing one or more index dimensions
The gory parts: Dynamic MPI groups (2) Collective vs. Non-Collective communicator creation MPI_Comm_split() is collective MPI_Comm_create_group() is collective only over the group The group can be created using MPI_Group_incl(), which is a local operation on each rank So the protocol can be Master rank 0 selects worker rank set S Master sends ranks in S form_group(S) message Each rank in S do MPI_Group_incl(WORLD_GROUP,S,&G(S)) Master and each rank in S do MPI_Group_incl(WORLD_GROUP,? 0, &M(S)) Each rank in S does MPI_Comm_create_group() to obtain the C(S) communicator Master and each rank in S do MPI_Comm_create_group() to C(? 0) communicator C(S) communicator is used for distributed solver C(? 0) communicator is used for master-worker communication Possibly a dedicated rank or set of ranks in S does the communication with the master rank When distributed solver is done: C(S) and C(? 0), and G(S), and M(S) are destroyed
Data transfer between subsolvers Currently file-based
Summary UG can already pass an MPI communicator to your solver (UGS superclass) We ll work on a simplistic resource (=rank) allocation toolset There s a nice optimization problem hiding here K-machine job scheduling, plus Communication-aware rank subset selection Depends on having communication model for the solver Solver scaling choices: more smaller or fewer bigger solvers?
This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No 773897. Utz-Uwe Haus, uhaus@cray.com