Parallel Sorting Algorithms and Amdahl's Law

1
CSE 332: Parallel Sorting
Richard Anderson
Spring 2016
2
Announcements
3
Recap
Last lectures
simple parallel programs
common patterns:  map, reduce
analysis tools (work, span, parallelism)
Now
Amdahl’s Law
Parallel quicksort, merge sort
useful building blocks:  prefix, pack
Analyzing Parallel Programs
L
e
t
 
T
P
 
b
e
 
t
h
e
 
r
u
n
n
i
n
g
 
t
i
m
e
 
o
n
 
P
 
p
r
o
c
e
s
s
o
r
s
Two key measures of run-time:
W
o
r
k
:
 
H
o
w
 
l
o
n
g
 
i
t
 
w
o
u
l
d
 
t
a
k
e
 
1
 
p
r
o
c
e
s
s
o
r
 
=
 
T
1
S
p
a
n
:
 
H
o
w
 
l
o
n
g
 
i
t
 
w
o
u
l
d
 
t
a
k
e
 
i
n
f
i
n
i
t
y
 
p
r
o
c
e
s
s
o
r
s
 
=
 
T
The hypothetical ideal for parallelization
This is the longest “dependence chain” in the computation
Example: 
O
(
log
 
n
) for summing an array
Also called “critical path length” or “computational depth”
4
Divide and Conquer Algorithms
Our 
fork
 and 
join
 frequently look like this:
In this context, the span (
T
) is:
The longest dependence-chain; longest ‘branch’ in parallel ‘tree’
Example: 
O
(log 
n
) for summing an array; we halve the data down to our
cut-off, then add back together; 
O
(log 
n
) steps, O(1) time for each
Also called “critical path length” or “computational depth”
5
Parallel Speed-up
S
p
e
e
d
-
u
p
 
o
n
 
P
 
p
r
o
c
e
s
s
o
r
s
:
 
T
1
 
/
 
T
P
I
f
 
s
p
e
e
d
-
u
p
 
i
s
 
P
,
 
w
e
 
c
a
l
l
 
i
t
 
p
e
r
f
e
c
t
 
l
i
n
e
a
r
 
s
p
e
e
d
-
u
p
e
.
g
.
,
 
d
o
u
b
l
i
n
g
 
P
 
h
a
l
v
e
s
 
r
u
n
n
i
n
g
 
t
i
m
e
hard to achieve in practice
P
a
r
a
l
l
e
l
i
s
m
 
i
s
 
t
h
e
 
m
a
x
i
m
u
m
 
p
o
s
s
i
b
l
e
 
s
p
e
e
d
-
u
p
:
 
T
1
 
/
 
T
 
if you had infinite processors
6
Estimating T
p
H
o
w
 
t
o
 
e
s
t
i
m
a
t
e
 
T
P
 
 
(
e
.
g
.
,
 
P
 
=
 
4
)
?
L
o
w
e
r
 
b
o
u
n
d
s
 
o
n
 
T
P
 
 
(
i
g
n
o
r
i
n
g
 
m
e
m
o
r
y
,
 
c
a
c
h
i
n
g
.
.
.
)
1.
T
 
2.
T
1
 
/
 
P
which one is the tighter (higher) lower bound?
T
h
e
 
F
o
r
k
J
o
i
n
 
J
a
v
a
 
F
r
a
m
e
w
o
r
k
 
a
c
h
i
e
v
e
s
 
t
h
e
 
f
o
l
l
o
w
i
n
g
e
x
p
e
c
t
e
d
 
t
i
m
e
 
a
s
y
m
p
t
o
t
i
c
 
b
o
u
n
d
:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
T
P
 
 
ϵ
 
 
O
(
T
 
 
+
 
T
1
 
/
 
P
)
this bound is optimal
7
Amdahl’s Law
Most programs have
1.
parts that parallelize well
2.
parts that don’t parallelize at all
The latter become bottlenecks
8
Amdahl’s Law
Let T
1
 = 1 unit of time
Let S = proportion that can’t be parallelized
1
 
=
 
T
1
 
=
 
S
 
+
 
(
1
 
 
S
)
Suppose we get perfect linear speedup on the parallel portion:
T
P
 
=
S
o
 
t
h
e
 
o
v
e
r
a
l
l
 
s
p
e
e
d
-
u
p
 
o
n
 
P
 
p
r
o
c
e
s
s
o
r
s
 
i
s
 
(
A
m
d
a
h
l
s
 
L
a
w
)
:
T
1
 
/
 
T
 
P
 
=
T
1
 
/
 
T
 
 
=
If 1/3 of your program is parallelizable, max speedup is:
9
Pretty Bad News
Suppose 25% of your program is sequential.
Then a billion processors won’t give you more than a 4x
speedup!
What portion of your program must be parallelizable
to get 10x speedup on a 1000 core GPU?
10 <= 1 / (S + (1-S)/1000)
Motivates minimizing sequential portions of your
programs
10
Take Aways
Parallel algorithms can be a big win
Many fit standard patterns that are easy to implement
Can’t just rely on more processors to make things
faster (Amdahl’s Law)
11
12
Parallelizable?
Fibonacci (N)
13
Parallelizable?
input
output
6
3
11
10
8
2
7
8
14
First Pass:  Sum
Sum [0,7]:
15
First Pass:  Sum
Sum [0,7]:
Sum [0,3]:
Sum [4,7]:
Sum [0,1]:
Sum [2,3]:
Sum [4,5]:
Sum [5,7]:
16
2nd Pass:  
Use Sum for Prefix-Sum
Sum [0,7]:
S
u
m
<
0
:
Sum [0,3]:
S
u
m
<
0
:
Sum [4,7]:
S
u
m
<
4
:
Sum [0,1]:
S
u
m
<
0
:
Sum [2,3]:
S
u
m
<
2
:
Sum [4,5]:
S
u
m
<
4
:
Sum [6,7]:
S
u
m
<
6
:
17
2nd Pass:  
Use Sum for Prefix-Sum
Go from root down to leaves
Root
sum<0 =
Left-child
sum<K =
Right-child
sum<K =
18
Prefix-Sum Analysis
First Pass (Sum):
span =
Second Pass:
single pass from root down to leaves
update children’s sum<K value based on parent and sibling
span =
Total
span =
19
Parallel Prefix, Generalized
Prefix-sum is another common pattern (prefix problems)
maximum element 
to the left of i
is there an element 
to the left of i 
i satisfying some property?
count of elements 
to the left of i
 satisfying some property
We can solve all of these problems in the same way
20
Pack
Pack:
Output array of elements satisfying 
test
, in original order
input
output
6
3
11
10
8
2
7
8
test:  X < 8?
21
Parallel Pack?
Pack
D
e
t
e
r
m
i
n
i
n
g
 
w
h
i
c
h
 
e
l
e
m
e
n
t
s
 
t
o
 
i
n
c
l
u
d
e
 
i
s
 
e
a
s
y
D
e
t
e
r
m
i
n
i
n
g
 
w
h
e
r
e
 
e
a
c
h
 
e
l
e
m
e
n
t
 
g
o
e
s
 
i
n
 
o
u
t
p
u
t
 
i
s
 
h
a
r
d
seems to depend on previous results
input
output
6
3
11
10
8
2
7
8
test:  X < 8?
22
Parallel Pack
input
test
6
3
11
10
8
2
7
8
test:  X < 8?
1.  map test input, output [0,1] bit vector
23
Parallel Pack
input
test
6
3
11
10
8
2
7
8
test:  X < 8?
1.  map test input, output [0,1] bit vector
2.  transform bit vector into array of indices into result array
pos
24
Parallel Pack
input
test
6
3
11
10
8
2
7
8
test:  X < 8?
1.  map test input, output [0,1] bit vector
2.  prefix-sum on bit vector
3.  map input to corresponding positions in output
pos
 
- 
if (test[i] == 1) output[pos[i]] = input[i]
output
25
Parallel Pack Analysis
Parallel Pack
1.
map:             O(        ) span
2.
sum-prefix:   O(        ) span
3.
map:             O(        ) span
Total:      O(        ) span
26
Sequential Quicksort
Quicksort (review):
1.
Pick a pivot                                                   O(1)
2.
Partition into two sub-arrays                         O(n)
A.
values less than pivot
B.
values greater than pivot
3.
Recursively sort A and B                           2T(n/2), avg
Complexity (avg case)
T(n) = n + 2T(n/2)            T(0) = T(1) = 1
O(n logn)
How to parallelize?
27
Parallel Quicksort
Quicksort
1.
Pick a pivot                                                   O(1)
2.
Partition into two sub-arrays                         O(n)
A.
values less than pivot
B.
values greater than pivot
3.
Recursively sort A and B 
in parallel
           
T(n/2)
, avg
Complexity (avg case)
T(n) = n + 
T(n/2)
            T(0) = T(1) = 1
Span:
 
 O(        )
Parallelism (work/span) = O(             )
28
Taking it to the next level…
O(
log n
) speed-up with infinite processors is okay, but
a bit underwhelming
Sort 10
9
 elements 30x faster
Bottleneck:
29
Parallel Partition
Partition into sub-arrays
A.
values less than pivot
B.
values greater than pivot
What parallel operation can we use for this?
30
Parallel Partition
Pick pivot
Pack (test: 
<6
)
Right pack (test: 
>=6
)
8
0
6
6
31
Parallel Quicksort
Quicksort
1.
Pick a pivot                                                   O(1)
2.
Partition into two sub-arrays                     
O(      ) span
A.
values less than pivot
B.
values greater than pivot
3.
Recursively sort A and B in parallel           T(n/2), avg
Complexity (avg case)
T(n) = 
O(       )
 + T(n/2)            T(0) = T(1) = 1
Span:
 
 O(        )
Parallelism (work/span) = O(             )
32
Sequential Mergesort
Mergesort (review):
1.
Sort left and right halves                               2T(n/2)
2.
Merge results                                                  O(n)
Complexity (worst case)
T(n) = n + 2T(n/2)            T(0) = T(1) = 1
O(n logn)
How to parallelize?
Do left + right in parallel, improves to O(n)
To do better, we need to…
 
33
Parallel Merge
How to merge two sorted lists in parallel?
34
Parallel Merge
1.
Choose median M of left half             O(         )
2.
Split both arrays into < M, >=M          O(         )
how?
M
35
Parallel Merge
1.
Choose median M of left half
2.
Split both arrays into < M, >=M
how?
3.
Do two submerges in parallel
36
merge
merge
37
merge
merge
When we do each merge in parallel:
we split the bigger array in half
use binary search to split the smaller array
And in base case we copy to the output array
38
Parallel Mergesort Pseudocode
Merge(arr[], left
1
, left
2
, right
1
, right
2
, out[], out
1
, out
2
 )
 
int leftSize = left
2
 – left
1
 
int rightSize = right
2
 – right
1
 
// Assert: out
2
 – out
1
 = leftSize + rightSize
 
// We will assume leftSize > rightSize without loss of generality
 
 
if (leftSize + rightSize < CUTOFF)
  
sequential merge and copy into out[out1..out2]
 
 
int mid = (left
2
 – left
1
)/2
 
binarySearch arr[right1..right2] to find j such that
  
arr[j] ≤ arr[mid] ≤ arr[j+1]
 
 
Merge(arr[], left
1
, mid, right
1
, j, out[], out
1
, out
1
+mid+j)
 
Merge(arr[], mid+1, left
2
, j+1, right
2
, out[], out
1
+mid+j+1, out
2
)
39
Analysis
Parallel Merge 
(worst case)
Height of partition call tree with n elements:   O(           )
Complexity of each thread 
(ignoring recursive call)
:  O(           )
Span:   O(             )
Parallel Mergesort 
(worst case)
Span:  O(              )
Parallelism (work / span):  O(                    )
Subtlety:  uneven splits
but even in worst case, get a 3/4 to 1/4 split
still gives O(log n) height
40
Parallel Quicksort vs. Mergesort
Parallelism (work / span)
quicksort:   O(n / log n)        avg case
mergesort:  O(n / log
2
 n)      worst case
Slide Note
Embed
Share

Exploring the concepts of parallel sorting algorithms, analyzing parallel programs, divide and conquer algorithms, parallel speed-up, estimating running time on multiple processors, and understanding Amdahl's Law in parallel computing. The content covers key measures of run-time, divide and conquer strategies, parallel speed-up calculations, estimating running time, and the impact of parallelization on program performance.

  • Parallel Sorting
  • Amdahls Law
  • Divide and Conquer
  • Parallel Programs
  • Parallel Computing

Uploaded on Sep 11, 2024 | 3 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CSE 332: Parallel Sorting Richard Anderson Spring 2016 1

  2. Announcements 2

  3. Recap Last lectures simple parallel programs common patterns: map, reduce analysis tools (work, span, parallelism) Now Amdahl s Law Parallel quicksort, merge sort useful building blocks: prefix, pack 3

  4. Analyzing Parallel Programs Let TP be the running time on P processors Two key measures of run-time: Work: How long it would take 1 processor = T1 Span: How long it would take infinity processors = T The hypothetical ideal for parallelization This is the longest dependence chain in the computation Example: O(logn) for summing an array Also called critical path length or computational depth 4

  5. Divide and Conquer Algorithms Our fork and join frequently look like this: divide base cases combine results In this context, the span (T ) is: The longest dependence-chain; longest branch in parallel tree Example: O(log n) for summing an array; we halve the data down to our cut-off, then add back together; O(log n) steps, O(1) time for each Also called critical path length or computational depth 5

  6. Parallel Speed-up Speed-up on P processors: T1 / TP If speed-up is P, we call it perfect linear speed-up e.g., doubling P halves running time hard to achieve in practice Parallelism is the maximum possible speed-up: T1 / T if you had infinite processors 6

  7. Estimating Tp How to estimate TP (e.g., P = 4)? Lower bounds on TP(ignoring memory, caching...) 1. T 2. T1 / P which one is the tighter (higher) lower bound? The ForkJoin Java Framework achieves the following expected time asymptotic bound: TP O(T + T1 / P) this bound is optimal 7

  8. Amdahls Law Most programs have 1. parts that parallelize well 2. parts that don t parallelize at all The latter become bottlenecks 8

  9. Amdahls Law Let T1 = 1 unit of time Let S = proportion that can t be parallelized 1 = T1 = S + (1 S) Suppose we get perfect linear speedup on the parallel portion: TP = So the overall speed-up on P processors is (Amdahl s Law): T1 / T P = T1 / T = If 1/3 of your program is parallelizable, max speedup is: 9

  10. Pretty Bad News Suppose 25% of your program is sequential. Then a billion processors won t give you more than a 4x speedup! What portion of your program must be parallelizable to get 10x speedup on a 1000 core GPU? 10 <= 1 / (S + (1-S)/1000) Motivates minimizing sequential portions of your programs 10

  11. Take Aways Parallel algorithms can be a big win Many fit standard patterns that are easy to implement Can t just rely on more processors to make things faster (Amdahl s Law) 11

  12. Parallelizable? Fibonacci (N) 12

  13. Parallelizable? Prefix-sum: input 6 3 11 10 8 2 7 8 output ? 1?????[?] ??????[?] = 0 13

  14. First Pass: Sum Sum [0,7]: 6 11 8 7 3 10 2 8 14

  15. First Pass: Sum Sum [0,7]: Sum [0,3]: Sum [4,7]: Sum [2,3]: Sum [4,5]: Sum [5,7]: Sum [0,1]: 6 11 8 7 3 10 2 8 15

  16. 2nd Pass: Use Sum for Prefix-Sum Sum [0,7]: Sum<0: Sum [0,3]: Sum<0: Sum [4,7]: Sum<4: Sum [2,3]: Sum<2: Sum [4,5]: Sum<4: Sum [6,7]: Sum<6: Sum [0,1]: Sum<0: 6 11 8 7 3 10 2 8 16

  17. 2nd Pass: Use Sum for Prefix-Sum Sum [0,7]: Sum<0: Sum [0,3]: Sum<0: Sum [4,7]: Sum<4: Sum [2,3]: Sum<2: Sum [4,5]: Sum<4: Sum [6,7]: Sum<6: Sum [0,1]: Sum<0: 6 11 10 8 7 3 2 8 Go from root down to leaves Root sum<0 = Left-child sum<K = Right-child sum<K = 17

  18. Prefix-Sum Analysis First Pass (Sum): span = Second Pass: single pass from root down to leaves update children s sum<K value based on parent and sibling span = Total span = 18

  19. Parallel Prefix, Generalized Prefix-sum is another common pattern (prefix problems) maximum element to the left of i is there an element to the left of i i satisfying some property? count of elements to the left of i satisfying some property We can solve all of these problems in the same way 19

  20. Pack Pack: input test: X < 8? 6 3 11 10 8 2 7 8 output Output array of elements satisfying test, in original order 20

  21. Parallel Pack? Pack input test: X < 8? 6 3 11 10 8 2 7 8 output 6 2 3 7 Determining which elements to include is easy Determining where each element goes in output is hard seems to depend on previous results 21

  22. Parallel Pack 1. map test input, output [0,1] bit vector input test: X < 8? 6 3 11 10 8 2 7 8 test 1 0 0 1 1 0 1 0 22

  23. Parallel Pack 1. map test input, output [0,1] bit vector input test: X < 8? 6 3 11 10 8 2 7 8 test 1 0 0 1 1 0 1 0 2. transform bit vector into array of indices into result array pos 1 4 2 3 23

  24. Parallel Pack 1. map test input, output [0,1] bit vector input test: X < 8? 6 3 11 10 8 2 7 8 test 1 0 0 1 1 0 1 0 2. prefix-sum on bit vector pos 2 2 2 4 1 4 2 3 3. map input to corresponding positions in output output 6 2 3 7 - if (test[i] == 1) output[pos[i]] = input[i] 24

  25. Parallel Pack Analysis Parallel Pack 1. map: O( ) span 2. sum-prefix: O( ) span 3. map: O( ) span Total: O( ) span 25

  26. Sequential Quicksort Quicksort (review): 1. Pick a pivot O(1) 2. Partition into two sub-arrays O(n) A. values less than pivot B. values greater than pivot 3. Recursively sort A and B 2T(n/2), avg Complexity (avg case) T(n) = n + 2T(n/2) T(0) = T(1) = 1 O(n logn) How to parallelize? 26

  27. Parallel Quicksort Quicksort 1. Pick a pivot O(1) 2. Partition into two sub-arrays O(n) A. values less than pivot B. values greater than pivot 3. Recursively sort A and B in parallel T(n/2), avg Complexity (avg case) T(n) = n + T(n/2) T(0) = T(1) = 1 Span: O( ) Parallelism (work/span) = O( ) 27

  28. Taking it to the next level O(log n) speed-up with infinite processors is okay, but a bit underwhelming Sort 109 elements 30x faster Bottleneck: 28

  29. Parallel Partition Partition into sub-arrays A. values less than pivot B. values greater than pivot What parallel operation can we use for this? 29

  30. Parallel Partition Pick pivot 8 1 4 9 0 3 5 2 7 6 Pack (test: <6) 1 4 0 3 5 2 Right pack (test: >=6) 1 4 0 3 5 2 6 8 9 7 30

  31. Parallel Quicksort Quicksort 1. Pick a pivot O(1) 2. Partition into two sub-arrays O( ) span A. values less than pivot B. values greater than pivot 3. Recursively sort A and B in parallel T(n/2), avg Complexity (avg case) T(n) = O( ) + T(n/2) T(0) = T(1) = 1 Span: O( ) Parallelism (work/span) = O( ) 31

  32. Sequential Mergesort Mergesort (review): 1. Sort left and right halves 2T(n/2) 2. Merge results O(n) Complexity (worst case) T(n) = n + 2T(n/2) T(0) = T(1) = 1 O(n logn) How to parallelize? Do left + right in parallel, improves to O(n) To do better, we need to 32

  33. Parallel Merge 0 4 6 8 9 1 2 3 5 7 How to merge two sorted lists in parallel? 33

  34. Parallel Merge 0 4 6 M 8 9 1 2 3 5 7 1. Choose median M of left half O( ) 2. Split both arrays into < M, >=M O( ) how? 34

  35. Parallel Merge 0 4 6 8 9 1 2 3 5 7 merge merge 0 4 1 2 3 5 6 8 9 7 1. Choose median M of left half 2. Split both arrays into < M, >=M how? 3. Do two submerges in parallel 35

  36. 0 4 6 8 9 1 2 3 5 7 merge merge 0 4 1 2 3 5 6 8 9 7 0 4 1 2 3 5 6 7 8 9 merge merge merge 6 7 0 1 2 4 3 5 9 8 6 7 9 8 0 1 2 4 3 5 merge merge 6 7 0 1 2 4 3 5 9 8 0 1 2 3 4 5 6 7 8 9 36

  37. 0 4 6 8 9 1 2 3 5 7 merge merge 0 4 1 2 3 5 6 8 9 7 0 4 1 2 3 5 6 7 8 9 When we do each merge in parallel: we split the bigger array in half use binary search to split the smaller array And in base case we copy to the output array merge merge merge 6 7 0 1 2 4 3 5 9 8 6 7 9 8 0 1 2 4 3 5 merge merge 6 7 0 1 2 4 3 5 9 8 0 1 2 3 4 5 6 7 8 9 37

  38. Parallel Mergesort Pseudocode Merge(arr[], left1, left2, right1, right2, out[], out1, out2 ) int leftSize = left2 left1 int rightSize = right2 right1 // Assert: out2 out1 = leftSize + rightSize // We will assume leftSize > rightSize without loss of generality if (leftSize + rightSize < CUTOFF) sequential merge and copy into out[out1..out2] int mid = (left2 left1)/2 binarySearch arr[right1..right2] to find j such that arr[j] arr[mid] arr[j+1] Merge(arr[], left1, mid, right1, j, out[], out1, out1+mid+j) Merge(arr[], mid+1, left2, j+1, right2, out[], out1+mid+j+1, out2) 38

  39. Analysis Parallel Merge (worst case) Height of partition call tree with n elements: O( ) Complexity of each thread (ignoring recursive call): O( ) Span: O( ) Parallel Mergesort (worst case) Span: O( ) Parallelism (work / span): O( ) Subtlety: uneven splits 0 4 6 8 1 2 3 5 but even in worst case, get a 3/4 to 1/4 split still gives O(log n) height 39

  40. Parallel Quicksort vs. Mergesort Parallelism (work / span) quicksort: O(n / log n) avg case mergesort: O(n / log2 n) worst case 40

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#