Optimal Round and Sample-Size Complexity for Parallel Sorting Partitioning

 
Optimal Round and Sample-
Size Complexity for
Partitioning in Parallel Sorting
 
W
e
n
t
a
o
 
Y
a
n
g
*
,
 
V
i
p
u
l
 
H
a
r
s
h
*
,
 
E
d
g
a
r
 
S
o
l
o
m
o
n
i
k
 
University of Illinois at Urbana-Champaign
(UIUC)
 
*: equal
contribution
Parallel partitioning: subroutine in
parallel sorting
Input
p
 processors
n/p
 keys per processors
Goal: Find 
(p-1)
 
partition keys (or splitters) 
s
1
, ..., s
p-1
at most 
n(1+ε)/p
 input keys should be in 
[s
i 
, s
i+1
)
Processor
s
 
s
1
P
2
P
3
Keys
 
At most  
n(1+ε)/p 
keys
 
Global order:
All keys in sorted
order
P
1
 
s
2
 
Combined
sorted sample
 
S
1
 
S
2
 
S
p-1
 
Equally spaced (p-
1
)
 
splitter keys
Parallel partitioning approaches:
sampling
P
1
P
2
P
p
Input
 
Collect local
samples
[1] A sampling approach to minimal storage tree sorting, JACM, 1970
 
Sample complexity: O(p log
 
p/
ε
2
)
Combined
sorted sample
Parallel partitioning approaches:
histogramming
P
1
P
2
P
p
Input
Collect local
samples
3
23
37
43
77
99
123
H
i
s
t
o
g
r
a
m
m
i
n
g
:
 
 
E
a
c
h
 
p
r
o
c
e
s
s
o
r
 
c
o
m
p
u
t
e
s
r
a
n
k
s
 
o
f
 
t
h
e
 
S
 
k
e
y
s
 
i
n
 
i
t
s
 
l
o
c
a
l
 
i
n
p
u
t
.
 
 
T
h
e
n
 
a
r
e
d
u
c
t
i
o
n
 
t
o
 
o
b
t
a
i
n
 
g
l
o
b
a
l
 
r
a
n
k
s
 
o
f
 
a
l
l
 
S
 
k
e
y
s
 
Combined
sample with
rank
 
Run multiple steps of sampling + histogramming if
necessary
[2]  A comparison based parallel sorting algorithm. ICPP,
1993
[3] Histogram Sort with Sampling, SPAA, 2019
[4] 
Practical massively parallel sorting
. SPAA 2015
[5] 
Hyksort: A new variant of hypercube quicksort on distributed memory architectures. ICS
, 2013
 
Histogram sort [2]
HSS [3]
AMS-sort [4]
Hyk- sort [5]
In one round
processors communicate O(k) keys per round, in total
all processors perform some local computation after this
communication step
P
1
P
2
P
p
Total O(k) keys are
communicated
Parallel partitioning: model
 
Upper bound proof
Concrete algorithm: (a minor variation of HSS)
Improved analysis: O(log *p) rounds, O(p/log *p) comm.
per round
Previous HSS analysis: O(log log p) rounds, O(p) comm per
round
 
 
 
 
 
 
 
 
 
 
Lower bound proof
Any algorithm with O(p) comm per round requires Ω(log *p)
rounds
 
This paper: Parallel partitioning with
optimal rounds/communication
 
#1. Upper bound: algorithm (HSS) key
idea
1.
Sampling: Sample a small number of keys (O(p/log* p))
 
2.
Histogramming: Determine rank of all sampled keys
via a global reduction
same complexity as sampling with pipelined reduction
O(S log N), S = sample size
 
3.
If sample contains satisfactory splitters, return
Else, sample next set of keys and loop to (2)
 
Histogram Sort with Sampling (HSS), SPAA, 2019
#1. Upper bound: HSS illustration
n/p
2n/p
3n/p
4n/p
5n/p
Ideal splitters
 
First round of
s
ampling/histogrammi
ng
 
Second round of
sampling/histogramming
 
Splitter intervals after
second round
 
Splitter intervals
after first round
#1. Upper bound: Improved analysis
Focus on 
ε=1
Divide the sampling-histogramming rounds into 2 phases
 
Phase 1:  
when > p/log *p splitters
remain to be determined
> 
p/3log* p new splitters are
determined in every round
 
Phase 2: 
when <= p/log *p splitters
remain to be determined
The number of undetermined
splitters decreases exponentially
every round
#2. Lower bound for any comparison-based
algorithm
In one round
processors communicate O(p) keys per round, in total
all processors perform some local computation after this
communication step
P
1
P
2
P
p
Total O(p) keys are
communicated
#2. Lower bound: proof outline
Hardness for any deterministic algorithm for
some randomized input distribution
 
Hardness for any randomized
algorithm for worst case input
Yao’s principle
#2. Lower bound: a hard input distribution for any
deterministic algorithm
Global order: keys 1- N (in sequence),  distribute among p
processors
 
P
1
 
P
2
 
P
p
Distribute keys in some randomized
way
#2. Lower bound: a hard input distribution for any
deterministic algorithm
 
P
1
Permute randomly
 
P
1
 
P
2
 
P
3
 
P
4
 
P
5
 
P
6
 
P
7
 
P
8
 
A processor can’t
tell which
subinterval did it get
from a part
A part (divided into p subintervals)
 
sub-interval
 
if the algorithm samples r sub-
intervals within a part, then
that’s effectively random
 
Global order: keys 1- N (in sequence), divided into p blocks
 
#2. Lower bound: keep track of unsampled
subintervals
A part (divided into p subintervals)
 
sub-interval
 
#2. Lower bound: keep track of unsampled
subintervals
x
x
x
x
x
A part (divided into p subintervals)
sub-interval
 
 
0
sample + historgram p keys
 
Success run sequence: Unsampled
sequence of  subintervals in the same part
that contains > 
1
 unsampled blocks
 
derive expressions for X
i+
1
, Y
i+
1
 using distribution
theory of runs [4]
 
[4] Alexander M Mood. The distribution theory of runs.  The Annals of Mathematical Statistics,
1940
 
1
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
1
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
1
 
0
 
0
 
0
 
0
 
0
 
0
 
1
 
0
 
0
 
0
 
0
 
0
 
1
How the picture looks like
Conclusion
 
Thank you!
Optimal rounds and communication complexity
for parallel partitioning
Upper bound: improved analysis of HSS
O(log *p) rounds
O(p/log *p) comm per round
Previously: O(log log p) rounds, O(p) comm per
round
Lower bound
Any algorithm with O(p) comm. per round requires
Ω(log *p) rounds
Any 1-round algorithm must have Ω(p log p)
comm.
proves Sample-sort, AMS sort, 
1
-round HSS
are all optimal 
1
-round algorithms
#2. Lower bound: proof outline
P [ 
any randomized algorithm
       can’t find a balanced
       partition for worse case
input
]
P [
the best deterministic algorithm
     can’t find a balanced partition
for
     any randomized input
distribution
]
 
>
Prove hardness for any deterministic algorithm and use Yao’s
principle to get bounds for any randomized algorithm
Slide Note

Add text for equal contribution

Embed
Share

This paper explores optimal round and sample-size complexity for partitioning in parallel sorting, discussing parallel partitioning approaches such as sampling and histogramming. It presents a model where processors communicate a set number of keys per round, highlighting the trade-off between rounds and communication per round. The paper offers an improved algorithm analysis showing O(log * p) rounds and O(p/log * p) communication per round, surpassing previous analyses. Additionally, it establishes a lower bound proof for algorithms in this context.

  • Parallel Sorting
  • Optimal Complexity
  • Partitioning Approaches
  • Algorithm Analysis
  • Communication Trade-off

Uploaded on Oct 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Optimal Round and Sample- Size Complexity for Partitioning in Parallel Sorting Wentao Yang*, Vipul Harsh*, Edgar Solomonik University of Illinois at Urbana-Champaign (UIUC) *: equal contribution

  2. Parallel partitioning: subroutine in parallel sorting Input p processors n/p keys per processors Goal: Find (p-1)partition keys (or splitters) s1, ..., sp-1 at most n(1+ )/p input keys should be in [si , si+1) Keys Processor s P1 P2 P3 s2 s1 Global order: All keys in sorted order At most n(1+ )/p keys

  3. Parallel partitioning approaches: sampling Sample complexity: O(p logp/ 2) Equally spaced (p-1)splitter keys S1 S2 Sp-1 Combined sorted sample Collect local samples Input P1 P2 Pp [1] A sampling approach to minimal storage tree sorting, JACM, 1970

  4. Parallel partitioning approaches: histogramming Run multiple steps of sampling + histogramming if necessary Combined sample with rank 3 23 37 43 77 99 123 Histogram sort [2] HSS [3] AMS-sort [4] Hyk- sort [5] Histogramming: Each processor computes ranks of the S keys in its local input. Then a reduction to obtain global ranks of all S keys Combined sorted sample Collect local samples Input P1 P2 Pp [2] A comparison based parallel sorting algorithm. ICPP, 1993 [3] Histogram Sort with Sampling, SPAA, 2019 [4] Practical massively parallel sorting. SPAA 2015 [5] Hyksort: A new variant of hypercube quicksort on distributed memory architectures. ICS, 2013

  5. Parallel partitioning: model In one round processors communicate O(k) keys per round, in total all processors perform some local computation after this communication step Many algorithms belong to this category- Sample sort, Histogram sort, HSS, AMS-sort, HykSort Total O(k) keys are communicated P2 P1 Pp Number of rounds? Communication per round (k) ?

  6. This paper: Parallel partitioning with optimal rounds/communication Upper bound proof Concrete algorithm: (a minor variation of HSS) Improved analysis: O(log *p) rounds, O(p/log *p) comm. per round Previous HSS analysis: O(log log p) rounds, O(p) comm per round Lower bound proof Any algorithm with O(p) comm per round requires (log *p) rounds

  7. #1. Upper bound: algorithm (HSS) key idea Sampling: Sample a small number of keys (O(p/log* p)) 1. Histogramming: Determine rank of all sampled keys via a global reduction same complexity as sampling with pipelined reduction O(S log N), S = sample size 2. If sample contains satisfactory splitters, return Else, sample next set of keys and loop to (2) 3. Histogram Sort with Sampling (HSS), SPAA, 2019

  8. #1. Upper bound: HSS illustration 5n/p 4n/p 3n/p n/p 2n/p Ideal splitters First round of sampling/histogrammi ng Splitter intervals after first round Second round of sampling/histogramming Splitter intervals after second round

  9. #1. Upper bound: Improved analysis Focus on =1 Divide the sampling-histogramming rounds into 2 phases Phase 1: when > p/log *p splitters remain to be determined > p/3log* p new splitters are determined in every round Phase 2: when <= p/log *p splitters remain to be determined The number of undetermined splitters decreases exponentially every round Overall both phases end in O(log *p) rounds with O(p/log *p) communication per round

  10. #2. Lower bound for any comparison-based algorithm In one round processors communicate O(p) keys per round, in total all processors perform some local computation after this communication step Total O(p) keys are communicated P2 P1 Pp Many algorithms belong to this category- e.g. Sample sort, Histogram sort, HSS, AMS-sort, HykSort

  11. #2. Lower bound: proof outline Hardness for any randomized algorithm for worst case input Yao s principle Hardness for any deterministic algorithm for some randomized input distribution We design an input distribution which is prove is hard for any deterministic algorithm

  12. #2. Lower bound: a hard input distribution for any deterministic algorithm Global order: keys 1- N (in sequence), distribute among p processors The hardness result doesn t hold for a uniformly random distribution Distribute keys in some randomized way P1 Pp P2

  13. #2. Lower bound: a hard input distribution for any deterministic algorithm Global order: keys 1- N (in sequence), divided into p blocks A part (divided into p subintervals) sub-interval P1 Permute randomly if the algorithm samples r sub- intervals within a part, then that s effectively random A processor can t tell which subinterval did it get from a part P1P2P3P4 P5P6 P7P8

  14. #2. Lower bound: keep track of unsampled subintervals A part (divided into p subintervals) sub-interval

  15. #2. Lower bound: keep track of unsampled subintervals A part (divided into p subintervals) sub-interval x x x x x 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 Success run sequence: Unsampled sequence of subintervals in the same part that contains > 1 unsampled blocks sample + historgram p keys In round (i+1): At least Xi+1 sequences, each of Yi+1 unsampled subintervals In round i: Xi sequences, each of Yi unsampled subintervals derive expressions for Xi+1, Yi+1 using distribution theory of runs [4] [4] Alexander M Mood. The distribution theory of runs. The Annals of Mathematical Statistics, 1940

  16. How the picture looks like Optimal comm. for 1- round Optimal rounds for O(p) comm./round No algorithm can be strictly inside or at the boundary of the shaded region

  17. Conclusion Optimal rounds and communication complexity for parallel partitioning Upper bound: improved analysis of HSS O(log *p) rounds O(p/log *p) comm per round Previously: O(log log p) rounds, O(p) comm per round Lower bound Any algorithm with O(p) comm. per round requires (log *p) rounds Any 1-round algorithm must have (p log p) comm. proves Sample-sort, AMS sort, 1-round HSS are all optimal 1-round algorithms Thank you!

  18. global distribution splitter interval 1 splitter interval 2 ... ... contiguous sequence 1 contiguous sequence 3 subinterval 2 subinterval 1 ... ... ... local distribution at processor j local distribution at processor i

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#