Optimal Round and Sample-Size Complexity for Parallel Sorting Partitioning
This paper explores optimal round and sample-size complexity for partitioning in parallel sorting, discussing parallel partitioning approaches such as sampling and histogramming. It presents a model where processors communicate a set number of keys per round, highlighting the trade-off between rounds and communication per round. The paper offers an improved algorithm analysis showing O(log * p) rounds and O(p/log * p) communication per round, surpassing previous analyses. Additionally, it establishes a lower bound proof for algorithms in this context.
- Parallel Sorting
- Optimal Complexity
- Partitioning Approaches
- Algorithm Analysis
- Communication Trade-off
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Optimal Round and Sample- Size Complexity for Partitioning in Parallel Sorting Wentao Yang*, Vipul Harsh*, Edgar Solomonik University of Illinois at Urbana-Champaign (UIUC) *: equal contribution
Parallel partitioning: subroutine in parallel sorting Input p processors n/p keys per processors Goal: Find (p-1)partition keys (or splitters) s1, ..., sp-1 at most n(1+ )/p input keys should be in [si , si+1) Keys Processor s P1 P2 P3 s2 s1 Global order: All keys in sorted order At most n(1+ )/p keys
Parallel partitioning approaches: sampling Sample complexity: O(p logp/ 2) Equally spaced (p-1)splitter keys S1 S2 Sp-1 Combined sorted sample Collect local samples Input P1 P2 Pp [1] A sampling approach to minimal storage tree sorting, JACM, 1970
Parallel partitioning approaches: histogramming Run multiple steps of sampling + histogramming if necessary Combined sample with rank 3 23 37 43 77 99 123 Histogram sort [2] HSS [3] AMS-sort [4] Hyk- sort [5] Histogramming: Each processor computes ranks of the S keys in its local input. Then a reduction to obtain global ranks of all S keys Combined sorted sample Collect local samples Input P1 P2 Pp [2] A comparison based parallel sorting algorithm. ICPP, 1993 [3] Histogram Sort with Sampling, SPAA, 2019 [4] Practical massively parallel sorting. SPAA 2015 [5] Hyksort: A new variant of hypercube quicksort on distributed memory architectures. ICS, 2013
Parallel partitioning: model In one round processors communicate O(k) keys per round, in total all processors perform some local computation after this communication step Many algorithms belong to this category- Sample sort, Histogram sort, HSS, AMS-sort, HykSort Total O(k) keys are communicated P2 P1 Pp Number of rounds? Communication per round (k) ?
This paper: Parallel partitioning with optimal rounds/communication Upper bound proof Concrete algorithm: (a minor variation of HSS) Improved analysis: O(log *p) rounds, O(p/log *p) comm. per round Previous HSS analysis: O(log log p) rounds, O(p) comm per round Lower bound proof Any algorithm with O(p) comm per round requires (log *p) rounds
#1. Upper bound: algorithm (HSS) key idea Sampling: Sample a small number of keys (O(p/log* p)) 1. Histogramming: Determine rank of all sampled keys via a global reduction same complexity as sampling with pipelined reduction O(S log N), S = sample size 2. If sample contains satisfactory splitters, return Else, sample next set of keys and loop to (2) 3. Histogram Sort with Sampling (HSS), SPAA, 2019
#1. Upper bound: HSS illustration 5n/p 4n/p 3n/p n/p 2n/p Ideal splitters First round of sampling/histogrammi ng Splitter intervals after first round Second round of sampling/histogramming Splitter intervals after second round
#1. Upper bound: Improved analysis Focus on =1 Divide the sampling-histogramming rounds into 2 phases Phase 1: when > p/log *p splitters remain to be determined > p/3log* p new splitters are determined in every round Phase 2: when <= p/log *p splitters remain to be determined The number of undetermined splitters decreases exponentially every round Overall both phases end in O(log *p) rounds with O(p/log *p) communication per round
#2. Lower bound for any comparison-based algorithm In one round processors communicate O(p) keys per round, in total all processors perform some local computation after this communication step Total O(p) keys are communicated P2 P1 Pp Many algorithms belong to this category- e.g. Sample sort, Histogram sort, HSS, AMS-sort, HykSort
#2. Lower bound: proof outline Hardness for any randomized algorithm for worst case input Yao s principle Hardness for any deterministic algorithm for some randomized input distribution We design an input distribution which is prove is hard for any deterministic algorithm
#2. Lower bound: a hard input distribution for any deterministic algorithm Global order: keys 1- N (in sequence), distribute among p processors The hardness result doesn t hold for a uniformly random distribution Distribute keys in some randomized way P1 Pp P2
#2. Lower bound: a hard input distribution for any deterministic algorithm Global order: keys 1- N (in sequence), divided into p blocks A part (divided into p subintervals) sub-interval P1 Permute randomly if the algorithm samples r sub- intervals within a part, then that s effectively random A processor can t tell which subinterval did it get from a part P1P2P3P4 P5P6 P7P8
#2. Lower bound: keep track of unsampled subintervals A part (divided into p subintervals) sub-interval
#2. Lower bound: keep track of unsampled subintervals A part (divided into p subintervals) sub-interval x x x x x 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 Success run sequence: Unsampled sequence of subintervals in the same part that contains > 1 unsampled blocks sample + historgram p keys In round (i+1): At least Xi+1 sequences, each of Yi+1 unsampled subintervals In round i: Xi sequences, each of Yi unsampled subintervals derive expressions for Xi+1, Yi+1 using distribution theory of runs [4] [4] Alexander M Mood. The distribution theory of runs. The Annals of Mathematical Statistics, 1940
How the picture looks like Optimal comm. for 1- round Optimal rounds for O(p) comm./round No algorithm can be strictly inside or at the boundary of the shaded region
Conclusion Optimal rounds and communication complexity for parallel partitioning Upper bound: improved analysis of HSS O(log *p) rounds O(p/log *p) comm per round Previously: O(log log p) rounds, O(p) comm per round Lower bound Any algorithm with O(p) comm. per round requires (log *p) rounds Any 1-round algorithm must have (p log p) comm. proves Sample-sort, AMS sort, 1-round HSS are all optimal 1-round algorithms Thank you!
global distribution splitter interval 1 splitter interval 2 ... ... contiguous sequence 1 contiguous sequence 3 subinterval 2 subinterval 1 ... ... ... local distribution at processor j local distribution at processor i