Algorithmic Complexity Measures and the Master Method

 
1
 
L
ECTURE
 2
Matrix Multiplication
Tableau Construction
Recurrences (Review)
Conclusion
Merge Sort
undefined
Algorithmic Complexity Measures
T
P
 =
 execution time on 
P
 processors
T
1
 = 
work
L
OWER
 B
OUNDS
T
P
 
¸
 
T
1
/
P
T
P
 
¸
 
T
1
*
 
Also called 
critical-path length
or 
computational depth
.
T
1
 = 
span
*
undefined
 
Speedup
 
Definition:
 
T
1
/T
P
 
= 
speedup
 
 
on 
P
 processors.
undefined
 
July 13, 2006      4
 
Parallelism
 
Because we have the lower
bound 
T
P
 
¸
 
T
1
, the
maximum possible speedup
given 
T
1
 and 
T
1
 is
T
1
/T
1
 
=
 
parallelism
 
=
 
the average amount
  
of work per step
  
along the span.
5
The Master Method
The 
Master Method
 for solving recurrences
applies to recurrences of the form
T
(
n
) = 
a
 
T
(
n
/
b
) + 
f
 
(
n
)
 
,
where 
a
 
¸
 1
, 
b
 > 1
, and 
 
f
 
 is asymptotically
positive.
I
DEA
:
 Compare 
n
log
b
a 
with 
f
 
(
n
)
 
.
*
 
The unstated base case is 
T
(
n
) = 
(1) 
for
sufficiently small 
n
.
*
6
Master Method — 
C
ASE
 1
n
log
b
a
 
À 
f
 (
n
)
 
Specifically, 
 
f
 
(
n
) = 
O
(
n
log
b
a 
)
 for some
constant 
 > 0
.
Solution:
 
T
(
n
) = 
(
n
log
b
a
)
 .
T
(
n
) = 
a
 
T
(
n
/
b
) + 
f
 
(
n
)
7
Master Method — 
C
ASE
 2
 
Specifically, 
 
f
 
(
n
) = 
(
n
log
b
a
 
lg
k
n
)
 for some
constant 
k
 
¸
 0
.
Solution:
 
T
(
n
) = 
(
n
log
b
a 
lg
k+
1
n
)
 .
n
log
b
a
 
¼ 
f
 (
n
)
T
(
n
) = 
a
 
T
(
n
/
b
) + 
f
 
(
n
)
8
Master Method — 
C
ASE
 3
 
Specifically, 
 
f
 
(
n
) = 
(
n
log
b
a 
+ 
)
 for some
constant 
 > 0 
and
  
f
 
(
n
)
 
satisfies the
regularity condition
 
that 
a
 
f
 
(
n/b
) 
·
 
c
 
f
 
(
n
)
for some constant 
c
 < 1
.
Solution:
 
T
(
n
) = 
(
f
 
(
n
))
 .
n
log
b
a
 
¿ 
f
 (
n
)
T
(
n
) = 
a
 
T
(
n
/
b
) + 
f
 
(
n
)
 
9
 
Master Method Summary
 
C
ASE
 1
:
 
f
 
(
n
) = 
O
(
n
log
b
a 
)
, constant 
 > 0
 
T
(
n
) = 
(
n
log
b
a
)
 .
C
ASE
 2
:
 
f
 
(
n
) = 
(
n
log
b
a 
lg
k
n
)
, constant
 
k
 
 0
 
T
(
n
) = 
(
n
log
b
a
 lg
k
+1
n
)
 .
C
ASE
 3
:
 
f
 
(
n
) = 
(
n
log
b
a 
+ 

)
, constant 
 > 0
,
and regularity condition
 
T
(
n
) = 
(
 
f
 
(
n
))
 .
T
(
n
) = 
a
 
T
(
n
/
b
) + 
f
 
(
n
)
10
Master Method Quiz
 
T
(
n
) = 4 
T
(
n
/2) + 
n
 
n
log
b
a
 = n
2
 
À
 
n
 
)
 
C
ASE
 1:
 
T
(
n
) = 
(
n
2
)
.
T
(
n
) = 4 
T
(
n
/2) + 
n
2
 
n
log
b
a
 = n
2
 = 
n
2
 
lg
0
n
 
)
 
C
ASE
 2:
 
T
(
n
) = 
(
n
2
lg 
n
)
.
T
(
n
) = 4 
T
(
n
/2) + 
n
3
 
n
log
b
a
 = n
2
 
¿
 
n
3
 
)
 
C
ASE
 3:
 
T
(
n
) = 
(
n
3
)
.
T
(
n
) = 4 
T
(
n
/2) + 
n
2
/
 
lg 
n
 
Master method does not apply!
 
11
 
L
ECTURE
 2
 
Matrix Multiplication
 
Tableau Construction
 
Recurrences (Review)
 
Conclusion
 
Merge Sort
 
12
 
Square-Matrix Multiplication
 
=
 
£
 
C
 
A
 
B
 
Assume for simplicity that 
n
 = 2
k
.
13
Recursive Matrix Multiplication
 
8
 multiplications of 
(
n
/2) 
£
 
(
n
/2)
 matrices.
1
 addition of 
n 
£
 
n 
matrices.
Divide and conquer —
=
£
14
cilk
 void Mult(*C, *A, *B, n) {
  float *T = Cilk_alloca(n*n*sizeof(float));
  
h
 
base case & partition matrices 
i
 
 spawn
 Mult(C11,A11,B11,n/2);
 
  
spawn
 Mult(C12,A11,B12,n/2);
 
  
spawn
 Mult(C22,A21,B12,n/2);
 
  
spawn
 Mult(C21,A21,B11,n/2);
 
  
spawn
 Mult(T11,A12,B21,n/2);
 
  
spawn
 Mult(T12,A12,B22,n/2);
 
  
spawn
 Mult(T22,A22,B22,n/2);
 
  
spawn
 Mult(T21,A22,B21,n/2);
  
sync
;
  
spawn
 Add(C,T,n);
  
sync
; 
  return;
}
Matrix Multiply in Pseudo-Cilk
C 
= 
A
¢ 
B
Absence of type
declarations.
15
cilk
 void Mult(*C, *A, *B, n) {
  float *T = Cilk_alloca(n*n*sizeof(float));
  
h
 
base case & partition matrices 
i
 
 spawn
 Mult(C11,A11,B11,n/2);
 
  
spawn
 Mult(C12,A11,B12,n/2);
 
  
spawn
 Mult(C22,A21,B12,n/2);
 
  
spawn
 Mult(C21,A21,B11,n/2);
 
  
spawn
 Mult(T11,A12,B21,n/2);
 
  
spawn
 Mult(T12,A12,B22,n/2);
 
  
spawn
 Mult(T22,A22,B22,n/2);
 
  
spawn
 Mult(T21,A22,B21,n/2);
  
sync
;
  
spawn
 Add(C,T,n);
  
sync
; 
  return;
}
C 
= 
A
¢ 
B
Coarsen base cases
for efficiency.
Matrix Multiply in Pseudo-Cilk
16
cilk
 void Mult(*C, *A, *B, n) {
  float *T = Cilk_alloca(n*n*sizeof(float));
  
h
 
base case & partition matrices 
i
 
 spawn
 Mult(C11,A11,B11,n/2);
 
  
spawn
 Mult(C12,A11,B12,n/2);
 
  
spawn
 Mult(C22,A21,B12,n/2);
 
  
spawn
 Mult(C21,A21,B11,n/2);
 
  
spawn
 Mult(T11,A12,B21,n/2);
 
  
spawn
 Mult(T12,A12,B22,n/2);
 
  
spawn
 Mult(T22,A22,B22,n/2);
 
  
spawn
 Mult(T21,A22,B21,n/2);
  
sync
;
  
spawn
 Add(C,T,n);
  
sync
; 
  return;
}
C 
= 
A
¢ 
B
Submatrices are
produced by pointer
calculation, not
copying of elements.
Also need a row-
size argument for
array indexing.
Matrix Multiply in Pseudo-Cilk
 
17
cilk
 void Mult(*C, *A, *B, n) {
  float *T = Cilk_alloca(n*n*sizeof(float));
  
h
 
base case & partition matrices 
i
 
 spawn
 Mult(C11,A11,B11,n/2);
  
spawn
 Mult(C12,A11,B12,n/2);
  
spawn
 Mult(C22,A21,B12,n/2);
  
spawn
 Mult(C21,A21,B11,n/2);
  
spawn
 Mult(T11,A12,B21,n/2);
  
spawn
 Mult(T12,A12,B22,n/2);
  
spawn
 Mult(T22,A22,B22,n/2);
  
spawn
 Mult(T21,A22,B21,n/2);
  
sync
;
  
spawn
 Add(C,T,n);
  
sync
;
  return;
}
 
C 
= 
A
¢ 
B
cilk
 
void Add(*C, *T, n) {
  
h 
base case & partition matrices 
i
  
spawn
 Add(C11,T11,n/2);
  
spawn
 Add(C12,T12,n/2);
  
spawn
 Add(C21,T21,n/2);
  
spawn
 Add(C22,T22,n/2);
  
sync
;
  return;
}
 
C 
= 
C 
+ 
T
 
Matrix Multiply in Pseudo-Cilk
18
 
A
1
(
n
)
 
=
 
  ?
4
 
A
1
(
n/
2) + 
(1)
cilk
 void Add(*C, *T, n) {
  
h
 
base case & partition matrices 
i
  
spawn
 Add(C11,T11,n/2);
  
spawn
 Add(C12,T12,n/2);
 
  
spawn
 Add(C21,T21,n/2);
 
  
spawn
 Add(C22,T22,n/2);
  
sync
;
  return;
}
Work of Matrix Addition
Work:
 
n
log
b
a
 = 
n
log
2
4
 = 
n
2
 
À
 
(1)
 
.
 
C
ASE
 1
 
  
=
 
(
n
2
)
19
cilk
 void Add(*C, *T, n) {
  
h
 
base case & partition matrices 
i
  
spawn
 Add(C11,T11,n/2);
  
spawn
 Add(C12,T12,n/2);
 
  
spawn
 Add(C21,T21,n/2);
 
  
spawn
 Add(C22,T22,n/2);
  
sync
;
  return;
}
cilk
 void Add(*C, *T, n) {
  
h
 
base case & partition matrices 
i
 
 spawn Add(C11,T11,n/2);
  spawn Add(C12,T12,n/2);
 
  spawn Add(C21,T21,n/2);
 
  
spawn 
Add(C22,T22,n/2);
  
sync
;
  return;
}
 
A
1
(
n
)
 
=
 
  
?
Span of Matrix Addition
A
1
(
n/
2) + 
(1)
Span:
 
n
log
b
a
 = 
n
log
2
1
 = 1
 
)
 
f
 
(
n
) = 
(
n
log
b
a
 lg
0
n
) 
.
 
 
C
ASE
 2
 
  
=
 
(lg
 n
)
 
maximum
20
 
M
1
(
n
)
 
=
 
  ?
Work of Matrix Multiplication
 
8
 
M
1
(
n/
2) +
A
1
(
n
) + 
(1)
Work:
cilk
 void Mult(*C, *A, *B, n) {
  float *T = Cilk_alloca(n*n*sizeof(float));
  
h
 
base case & partition matrices 
i
  
spawn
 Mult(C11,A11,B11,n/2);
 
  
spawn
 Mult(C12,A11,B12,n/2);
 
  
  
spawn
 Mult(T21,A22,B21,n/2);
  
sync
;
  
spawn
 Add(C,T,n);
  
sync
; 
  return;
}
8
 
n
log
b
a
 = 
n
log
2
8
 = 
n
3
 
À
 
(
n
2
)
 
.
 
  
=
 
8
 
M
1
(
n/
2) + 
(
n
2
)
  
=
 
(
n
3
)
 
C
ASE
 1
21
cilk
 void Mult(*C, *A, *B, n) {
  float *T = Cilk_alloca(n*n*sizeof(float));
  
h
 
base case & partition matrices 
i
  
spawn
 Mult(C11,A11,B11,n/2);
 
  
spawn
 Mult(C12,A11,B12,n/2);
 
  
  
spawn
 Mult(T21,A22,B21,n/2);
  
sync
;
  
spawn
 Add(C,T,n);
  
sync
; 
  return;
}
cilk
 void Mult(*C, *A, *B, n) {
  float *T = Cilk_alloca(n*n*sizeof(float));
  
h
 
base case & partition matrices 
i
  
spawn Mult(C11,A11,B11,n/2);
 
  spawn Mult(C12,A11,B12,n/2);
 
  
  
spawn
 Mult(T21,A22,B21,n/2);
  
sync
;
  
spawn
 Add(C,T,n);
  
sync
; 
  return;
}
 
M
1
(
n
)
 
=
 
  ?
M
1
(
n/
2) + 
A
1
(
n
) + 
(1)
Span of Matrix Multiplication
Span:
 
n
log
b
a
 = 
n
log
2
1
 = 1
 
)
 
f
 
(
n
) = 
(
n
log
b
a
 lg
1
n
) 
.
 
  
=
 
M
1
(
n/
2) + 
(lg 
n
)
  
=
 
(lg
2 
n
)
 
C
ASE
 2
8
22
Parallelism of Matrix Multiply
 
For 
1000
 
£
 
1000
 matrices,
parallelism 
¼
 (10
3
)
3
/10
2
 = 10
7
.
23
cilk
 void Mult(*C, *A, *B, n) {
                                            
  
h
 
base case & partition matrices 
i
  
spawn
 Mult(C11,A11,B11,n/2);
 
  
spawn
 Mult(C12,A11,B12,n/2);
 
  
  
spawn
 Mult(T21,A22,B21,n/2);
  
sync
;
  
spawn
 Add(C,T,n);
  
sync
; 
  return;
}
Stack Temporaries
float *T = Cilk_alloca(n*n*sizeof(float));
In hierarchical-memory machines (especially chip
multiprocessors), memory accesses are so expensive that
minimizing storage often yields higher performance.
I
DEA
: 
 Trade off parallelism for less storage.
24
No-Temp Matrix Multiplication
cilk
 void MultA(*C, *A, *B, n) {
  
// C = C + A * B
  
h
 
base case & partition matrices 
i
  
spawn
 MultA(C11,A11,B11,n/2);
 
  
spawn
 MultA(C12,A11,B12,n/2);
 
  
spawn
 MultA(C22,A21,B12,n/2);
 
  
spawn
 MultA(C21,A21,B11,n/2);
 
  
sync
; 
  
spawn
 MultA(C21,A22,B21,n/2); 
  
spawn
 MultA(C22,A22,B22,n/2);
 
  
spawn
 
MultA(C12,A12,B22,n/2);
 
  
spawn
 
MultA(C11,A12,B21,n/2);
 
  
sync
; 
  return;
}
 
Saves space, but at what expense?
25
 
  
=
 
(
n
3
)
Work of No-Temp Multiply
 
M
1
(
n
)
 
=
 
  ?
 
8 
M
1
(
n/
2) + 
(1)
Work:
 
C
ASE
 1
cilk
 void MultA(*C, *A, *B, n) {
  
// C = C + A * B
  
h
 
base case & partition matrices 
i
  
spawn
 MultA(C11,A11,B11,n/2);
 
  
spawn
 MultA(C12,A11,B12,n/2);
 
  
spawn
 MultA(C22,A21,B12,n/2);
 
  
spawn
 MultA(C21,A21,B11,n/2);
 
  
sync
; 
  
spawn
 MultA(C21,A22,B21,n/2); 
  
spawn
 MultA(C22,A22,B22,n/2);
 
  
spawn
 
MultA(C12,A12,B22,n/2);
 
  
spawn
 
MultA(C11,A12,B21,n/2);
 
  
sync
; 
  return;
}
26
cilk
 void MultA(*C, *A, *B, n) {
  // C = C + A * B
  
h
 
base case & partition matrices 
i
  
spawn
 MultA(C11,A11,B11,n/2);
 
  
spawn
 MultA(C12,A11,B12,n/2);
 
  
spawn
 MultA(C22,A21,B12,n/2);
 
  
spawn
 MultA(C21,A21,B11,n/2);
 
  
sync
; 
  
spawn
 MultA(C21,A22,B21,n/2); 
  
spawn
 MultA(C22,A22,B22,n/2);
 
  
spawn
 
MultA(C12,A12,B22,n/2);
 
  
spawn
 
MultA(C11,A12,B21,n/2);
 
  
sync
; 
  return;
}
cilk
 void MultA(*C, *A, *B, n) {
 
 // C = C + A * B
  
h
 
base case & partition matrices 
i
  
spawn MultA(C11,A11,B11,n/2);
 
  spawn MultA(C12,A11,B12,n/2);
 
  spawn MultA(C22,A21,B12,n/2);
 
  
spawn
 MultA(C21,A21,B11,n/2);
 
  
sync
; 
  
spawn MultA(C21,A22,B21,n/2); 
  spawn MultA(C22,A22,B22,n/2);
 
  spawn MultA(C12,A12,B22,n/2);
 
  
spawn
 
MultA(C11,A12,B21,n/2);
 
  
sync
; 
  return;
}
 
  
=
 
(
n
)
 
M
1
(
n
)
 
=
 
  ?
Span of No-Temp Multiply
Span:
 
C
ASE
 1
2 
M
1
(
n/
2) + 
(1)
 
maximum
 
maximum
27
Parallelism of No-Temp Multiply
 
For 
1000
 
£
 
1000
 matrices,
parallelism 
¼
 (10
3
)
3
/10
3
 = 10
6
.
 
Faster in practice!
28
Testing Synchronization
 
Cilk language feature: 
A programmer can
check whether a Cilk procedure is “synched”
(without actually performing a 
sync
) by
testing the pseudovariable
 
SYNCHED
:
SYNCHED
 
= 0
 
)
 some spawned children
might not have returned.
SYNCHED
 
= 1
 
)
 all spawned children
have definitely returned.
29
Best of Both Worlds
 
cilk
 void Mult1(*C, *A, *B, n) {
// multiply & store
  
h
 
base case & partition matrices 
i
 
 spawn
 Mult1(C11,A11,B11,n/2); 
// multiply & store
  
spawn
 Mult1(C12,A11,B12,n/2);
  
spawn
 Mult1(C22,A21,B12,n/2);
  
spawn
 Mult1(C21,A21,B11,n/2);
  if (
SYNCHED
) {
    spawn
 MultA1(C11,A12,B21,n/2); 
// multiply & add
    
spawn
 MultA1(C12,A12,B22,n/2);
    
spawn
 MultA1(C22,A22,B22,n/2);
    
spawn
 MultA1(C21,A22,B21,n/2);
  } else {
    
float *T = Cilk_alloca(n*n*sizeof(float));
    
spawn
 Mult1(T11,A12,B21,n/2); 
// multiply & store
    
spawn
 Mult1(T12,A12,B22,n/2);
    
spawn
 Mult1(T22,A22,B22,n/2);
    
spawn
 Mult1(T21,A22,B21,n/2);
    
sync
;
    
spawn
 Add(C,T,n); 
// C = C + T
  }
  
sync
;
  return;
}
This code is just as parallel
as the original, but it only
uses more space if runtime
parallelism actually exists.
30
Ordinary Matrix Multiplication
c
ij
=
a
ik 
b
kj
 
B
UT
,
 this algorithm exhibits
poor locality and does not
exploit the cache hierarchy
of modern microprocessors,
especially CMP’s.
 
31
 
L
ECTURE
 2
 
Matrix Multiplication
 
Tableau Construction
 
Recurrences (Review)
 
Conclusion
 
Merge Sort
32
19
3
4
12
14
21
23
46
Merging Two Sorted Arrays
void Merge(int *C, int *A, int *B, int na, int nb) {
  while (na>0 && nb>0) {
    if (*A <= *B) {
      *C++ = *A++; na--;
    } else {
      *C++ = *B++; nb--;
    }
  }
  while (na>0) {
    *C++ = *A++; na--;
  }
  while (nb>0) {
    *C++ = *B++; nb--;
  }
}
Time to merge 
n
elements 
=   ?
 
(
n
)
.
33
cilk
 void MergeSort(int *B, int *A, int n) {
  if (n==1) {
    B[0] = A[0];
  } else {
    int *C;
    C = (int*) Cilk_alloca(n*sizeof(int));
    
spawn
 MergeSort(C, A, n/2);
    
spawn
 MergeSort(C+n/2, A+n/2, n-n/2);
    
sync
;
    Merge(B, C, C+n/2, n/2, n-n/2);
  } 
}
Merge Sort
14
46
19
3
12
33
4
21
 
merge
 
merge
 
merge
34
 
  
=
 
(
n 
lg
 n
)
 
T
1
(
n
)
 
=
 
   ?
2
 
T
1
(
n/
2) + 
(
n
)
Work of Merge Sort
Work:
 
C
ASE
 2
 
n
log
b
a
 = 
n
log
2
2
 = 
n
 
)
 
f
 
(
n
) = 
(
n
log
b
a
 lg
0
n
) 
.
cilk
 void MergeSort(int *B, int *A, int n) {
  if (n==1) {
    B[0] = A[0];
  } else {
    int *C;
    C = (int*) Cilk_alloca(n*sizeof(int));
    
spawn
 MergeSort(C, A, n/2);
    
spawn
 MergeSort(C+n/2, A+n/2, n-n/2);
    
sync
;
    Merge(B, C, C+n/2, n/2, n-n/2);
  } 
}
35
 
T
1
(
n
)
 
=
 
   ?
T
1
(
n/
2) + 
(
n
)
Span of Merge Sort
Span:
 
C
ASE
 3
 
 
 
=
 
(
n
)
 
n
log
b
a
 = 
n
log
2
1
 = 1
 
¿ 
(
n
) 
.
cilk
 void MergeSort(int *B, int *A, int n) {
  if (n==1) {
    B[0] = A[0];
  } else {
    int *C;
    C = (int*) Cilk_alloca(n*sizeof(int));
    
spawn
 MergeSort(C, A, n/2);
    
spawn
 MergeSort(C+n/2, A+n/2, n-n/2);
    sync;
    Merge(B, C, C+n/2, n/2, n-n/2);
  } 
}
36
Parallelism of Merge Sort
We need to parallelize the 
merge
!
37
B
A
0
na
0
nb
na
 
¸
 
nb
Parallel Merge
 
 
K
EY
 I
DEA
:
 
If the total number of elements to be
merged in the two arrays is 
n = na + nb
, the total
number of elements in the larger of the two
recursive merges is at most   
?
(3/4)
 
n
 .
38
Parallel Merge
cilk
 void P_Merge(int *C, int *A, int *B, 
                          int na, int nb) {
  if (na < nb) {
    
spawn
 P_Merge(C, B, A, nb, na);
  } else if (na==1) {
    if (nb == 0) {
      C[0] = A[0];
    } else {
      C[0] = (A[0]<B[0]) ? A[0] : B[0]; 
/* minimum */
      C[1] = (A[0]<B[0]) ? B[0] : A[0]; 
/* maximum */
    } 
  } else {
    int ma = na/2;
    int mb = BinarySearch(A[ma], B, nb);
    
spawn
 P_Merge(C, A, B, ma, mb);
    
spawn
 P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);
    
sync
;
  }
}
 
Coarsen base cases for efficiency.
39
 
T
1
(
n
)
 
=
 
   ?
T
1
(3
n/
4) + 
(lg 
n
)
Span of 
P_Merge
Span:
 
C
ASE
 2
 
 
 
=
 
(lg
2
n
)
cilk
 void P_Merge(int *C, int *A, int *B, 
                          int na, int nb) {
  if (na < nb) {
  
  } else {
    int ma = na/2;
    int mb = BinarySearch(A[ma], B, nb);
    
spawn
 P_Merge(C, A, B, ma, mb);
    
spawn
 P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);
    
sync
;
  }
}
 
n
log
b
a
 = 
n
log
4/3
1
 = 1
 
)
 
f
 
(
n
) = 
(
n
log
b
a
 lg
1
n
) 
.
40
T
1
(
n
) =   ?
T
1
(
n
) + 
T
1
((1–
)
n
) + 
(lg 
n
)
,
where
 1/4 
·
 
 
·
 3/4 
.
Work of 
P_Merge
Work:
cilk
 void P_Merge(int *C, int *A, int *B, 
                          int na, int nb) {
  if (na < nb) {
  
  } else {
    int ma = na/2;
    int mb = BinarySearch(A[ma], B, nb);
    
spawn
 P_Merge(C, A, B, ma, mb);
    
spawn
 P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);
    
sync
;
  }
}
 
C
LAIM
:
 
T
1
(
n
) = 
(
n
) 
.
41
Analysis of Work Recurrence
 
Substitution method:
  Inductive hypothesis is
T
1
(
k
) 
·
 
c
1
k
c
2
lg 
k
, where
 
c
1
,
c
2
 > 0
.  Prove that
the relation holds, and solve for 
c
1
 and 
c
2
.
T
1
(
n
) = 
T
1
(
n
) + 
T
1
((1–
)
n
) + 
(lg 
n
)
,
         
   where
 1/4 
·
 
 
·
 3/4 
.
 
 
T
1
(
n
)
 
=
 
T
1
(
n
) + 
T
1
((1–
)
n
) + 
(lg 
n
)
  
·
 
c
1
(
n
) – 
c
2
lg(
n
)
   
+ 
c
1
((1–
)
n
) – 
c
2
lg((1–
)
n
)
 
+ 
(lg 
n
)
42
Analysis of Work Recurrence
 
 
T
1
(
n
)
 
=
 
T
1
(
n
) + 
T
1
((1–
)
n
) + 
(lg 
n
)
  
·
 
c
1
(
n
) – 
c
2
lg(
n
)
   
+ 
c
1
(1–
)
n
c
2
lg((1–
)
n
)
 
+ 
(lg 
n
)
T
1
(
n
) = 
T
1
(
n
) + 
T
1
((1–
)
n
) + 
(lg 
n
)
,
         
   where
 1/4 
·
 
 
·
 3/4 
.
43
 
T
1
(
n
)
 
=
 
T
1
(
n
) + 
T
1
((1–
)
n
) + 
(lg 
n
)
  
·
 
c
1
(
n
) – 
c
2
lg(
n
)
 
   
+ 
c
1
(1–
)
n
c
2
lg((1–
)
n
)
 
+ 
(lg 
n
)
Analysis of Work Recurrence
 
  
·
 
c
1
n
c
2
lg(
n
)
 
c
2
lg((1–
)
n
)
 
+ 
(lg 
n
)
 
  
·
 
c
1
n
c
2 
( 
lg(
(1–
)) + 2
 
lg 
n 
)
 
+ 
(lg 
n
)
 
  
·
 
c
1
n
c
2 
lg 
n
   
(
c
2
(lg 
n
 
+ lg(
(1–
))) – 
(lg 
n
)
)
 
  
·
 
c
1
n
c
2 
lg 
n
by choosing 
c
1
 and 
c
2
 large enough.
T
1
(
n
) = 
T
1
(
n
) + 
T
1
((1–
)
n
) + 
(lg 
n
)
,
         
   where
 1/4 
·
 
 
·
 3/4 
.
 
44
 
Parallelism of 
P_Merge
 
Parallelism:
 
45
cilk
 void P_MergeSort(int *B, int *A, int n) {
  if (n==1) {
    B[0] = A[0];
  } else {
    int *C;
    C = (int*) Cilk_alloca(n*sizeof(int));
    
spawn
 P_MergeSort(C, A, n/2);
    
spawn
 P_MergeSort(C+n/2, A+n/2, n-n/2);
    sync;
 
spawn
 P_Merge(B, C, C+n/2, n/2, n-n/2);
  }
}
 
Parallel Merge Sort
46
 
T
1
(
n
)
 
=
 
2 
T
1
(
n/
2) + 
(
n
)
Work of Parallel Merge Sort
Work:
 
C
ASE
 2
 
  
=
 
(
n 
lg
 n
)
cilk
 void P_MergeSort(int *B, int *A, int n) {
  if (n==1) {
    B[0] = A[0];
  } else {
    int *C;
    C = (int*) Cilk_alloca(n*sizeof(int));
    
spawn
 P_MergeSort(C, A, n/2);
    
spawn
 P_MergeSort(C+n/2, A+n/2, n-n/2);
    sync;
 
spawn
 P_Merge(B, C, C+n/2, n/2, n-n/2);
  } 
}
47
Span of Parallel Merge Sort
 
T
1
(
n
)
 
=
 
   ?
T
1
(
n/
2) + 
(lg
2
n
)
Span:
 
C
ASE
 2
 
  
=
 
(lg
3
n
)
 
n
log
b
a
 = 
n
log
2
1
 = 1
 
)
 
f
 
(
n
) = 
(
n
log
b
a
 lg
2
n
) 
.
cilk
 void P_MergeSort(int *B, int *A, int n) {
  if (n==1) {
    B[0] = A[0];
  } else {
    int *C;
    C = (int*) Cilk_alloca(n*sizeof(int));
    
spawn
 P_MergeSort(C, A, n/2);
    
spawn
 P_MergeSort(C+n/2, A+n/2, n-n/2);
    sync;
 
spawn
 P_Merge(B, C, C+n/2, n/2, n-n/2);
  } 
}
 
48
 
Parallelism of Merge Sort
 
 
T
1
(
n
)
 
=
 
(
n
 lg
 n
)
 
Work:
 
 
T
1
(
n
)
 
=
 
(lg
3
n
)
 
Span:
 
Parallelism:
 
49
 
L
ECTURE
 2
 
Matrix Multiplication
 
Tableau Construction
 
Recurrences (Review)
 
Conclusion
 
Merge Sort
50
Tableau Construction
A
[
i
, 
j
] = 
f
 
(
 
A
[
i
, 
j
–1], 
A
[
i
–1, 
j
], 
A
[
i
–1, 
j
–1]
 
)
.
Problem:
 Fill in an 
n
 
£
 
n
 tableau 
A
, where
 
Dynamic
programming
Longest common
subsequence
Edit distance
Time warping
00
01
02
03
04
05
06
07
10
11
12
13
14
15
16
17
20
21
22
23
24
25
26
27
30
31
32
33
34
35
36
37
40
41
42
43
44
45
46
47
50
51
52
53
54
55
56
57
60
61
62
63
64
65
66
67
70
71
72
73
74
75
76
77
 
Work:
 
(
n
2
)
.
51
n
n
 
spawn
 I;
sync
;
spawn
 II;
spawn
 III;
sync
;
spawn
 IV;
sync
;
I
II
III
IV
 
Cilk code
Recursive Construction
52
n
n
Work: 
T
1
(
n
) =   ?
4
T
1
(
n
/2) + 
(1)
spawn
 I;
sync
;
spawn
 II;
spawn
 III;
sync
;
spawn
 IV;
sync
;
I
II
III
IV
Cilk code
Recursive Construction
= 
(
n
2
)
 
C
ASE
 1
53
Span: 
T
1
(
n
) =   ?
n
n
spawn
 I;
sync
;
spawn
 II;
spawn
 III;
sync
;
spawn
 IV;
sync
;
I
II
III
IV
Cilk code
Recursive Construction
3
T
1
(
n
/2) + 
(1)
= 
(
n
lg3
)
 
C
ASE
 1
 
54
 
Analysis of Tableau Construction
Work: 
T
1
(
n
) = 
(
n
2
)
Span: 
T
1
(
n
) = 
(
n
lg3
)
¼ 
(
n
1.58
)
 
Parallelism:
 
¼ 
(
n
0.42
)
55
n
 
spawn
 I;
sync
;
spawn
 II;
spawn
 III;
sync
;
spawn
 IV;
spawn
 V;
spawn
 VI
sync
;
spawn
 VII;
spawn
 VIII;
sync
;
spawn
 IX;
sync
;
A More-Parallel Construction
I
II
III
IV
V
VI
VII
VIII
IX
n
56
n
spawn
 I;
sync
;
spawn
 II;
spawn
 III;
sync
;
spawn
 IV;
spawn
 V;
spawn
 VI
sync
;
spawn
 VII;
spawn
 VIII;
sync
;
spawn
 IX;
sync
;
A More-Parallel Construction
I
II
III
IV
V
VI
VII
VIII
IX
n
Work: 
T
1
(
n
) =   ?
9
T
1
(
n
/3) + 
(1)
= 
(
n
2
)
 
C
ASE
 1
57
n
spawn
 I;
sync
;
spawn
 II;
spawn
 III;
sync
;
spawn
 IV;
spawn
 V;
spawn
 VI
sync
;
spawn
 VII;
spawn
 VIII;
sync
;
spawn
 IX;
sync
;
A More-Parallel Construction
I
II
III
IV
V
VI
VII
VIII
IX
n
Span: 
T
1
(
n
) =   ?
5
T
1
(
n
/3) + 
(1)
= 
(
n
log
3
5
)
 
C
ASE
 1
58
Analysis of Revised Construction
Work: 
T
1
(
n
) = 
(
n
2
)
Span: 
T
1
(
n
) = 
(
n
log
3
5
)
¼ 
(
n
1.46
)
Parallelism:
¼ 
(
n
0.54
)
 
More parallel by a factor of
(
n
0.54
)/
(
n
0.42
) = 
(
n
0.12
) 
.
 
59
 
L
ECTURE
 2
 
Matrix Multiplication
 
Tableau Construction
 
Recurrences (Review)
 
Conclusion
 
Merge Sort
60
Key Ideas
 
 
Cilk is simple: 
cilk
, 
spawn
, 
sync,
SYNCHED
Recurrences, recurrences, recurrences, …
Work & span
Work & span
Work & span
Work & span
Work & span
Work & span
Work & span
Work & span
Work & span
Work & span
Work & span
Work & span
Work & span
Work & span
Work & span
undefined
 
Palindrome
 
Propose a Cilk Palindrome solver.
What is the key idea?
What is the Algorithm?
What is the Span?
What is the Work?
What is the Parallelism?
 
61
undefined
 
62
cilk
 void MergeSort(int *B, int *A, int n) {
  if (n==1) {
    B[0] = A[0];
  } else {
    int *C;
    C = (int*) Cilk_alloca(n*sizeof(int));
    
spawn
 MergeSort(C, A, n/2);
    
spawn
 MergeSort(C+n/2, A+n/2, n-n/2);
    
sync
;
    Merge(B, C, C+n/2, n/2, n-n/2);
  }
}
 
Palindrome
undefined
63
 
  
=
 
(
n 
lg
 n
)
 
T
1
(
n
)
 
=
 
   ?
2
 
T
1
(
n/
2) + 
(
n
)
Work of Palindrome
Work:
 
C
ASE
 2
 
n
log
b
a
 = 
n
log
2
2
 = 
n
 
)
 
f
 
(
n
) = 
(
n
log
b
a
 lg
0
n
) 
.
cilk
 void MergeSort(int *B, int *A, int n) {
  if (n==1) {
    B[0] = A[0];
  } else {
    int *C;
    C = (int*) Cilk_alloca(n*sizeof(int));
    
spawn
 MergeSort(C, A, n/2);
    
spawn
 MergeSort(C+n/2, A+n/2, n-n/2);
    
sync
;
    Merge(B, C, C+n/2, n/2, n-n/2);
  } 
}
undefined
64
 
T
1
(
n
)
 
=
 
   ?
T
1
(
n/
2) + 
(
n
)
Span of Palindrome
Span:
 
C
ASE
 3
 
 
 
=
 
(
n
)
 
n
log
b
a
 = 
n
log
2
1
 = 1
 
¿ 
(
n
) 
.
cilk
 void MergeSort(int *B, int *A, int n) {
  if (n==1) {
    B[0] = A[0];
  } else {
    int *C;
    C = (int*) Cilk_alloca(n*sizeof(int));
    
spawn
 MergeSort(C, A, n/2);
    
spawn
 MergeSort(C+n/2, A+n/2, n-n/2);
    sync;
    Merge(B, C, C+n/2, n/2, n-n/2);
  } 
}
Slide Note
Embed
Share

In this study, we explore key concepts in algorithmic complexity, including recurrences, matrix multiplication, merge sort, and tableau construction. We delve into the Master Method for solving recurrences, examining Cases 1, 2, and 3, and providing solutions for each scenario. Additionally, we discuss speedup, parallelism, and the critical path length in algorithmic analysis.

  • Algorithmic Complexity
  • Master Method
  • Recurrences
  • Parallelism
  • Speedup

Uploaded on Oct 03, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. LECTURE 2 Recurrences (Review) Recurrences (Review) Matrix Multiplication Matrix Multiplication Merge Sort Merge Sort Tableau Construction Tableau Construction Conclusion Conclusion 1

  2. Algorithmic Complexity Measures TP = execution time on P processors T1 = work T1 = span* LOWER BOUNDS TP T1/P TP T1 * Also called critical-path length or computational depth.

  3. Speedup Definition: T1/TP = speedup on P processors. If T1/TP = (P) P, we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound TP T1/P.

  4. Parallelism Because we have the lower bound TP T1, the maximum possible speedup given T1 and T1 is T1/T1 = parallelism = the average amount of work per step along the span. July 13, 2006 4

  5. The Master Method The Master Method for solving recurrences applies to recurrences of the form T(n) = aT(n/b) + f(n) , where a 1, b > 1, and fis asymptotically positive. * IDEA: Compare nlogba with f(n). * The unstated base case is T(n) = (1) for sufficiently small n. 5

  6. Master Method CASE 1 T(n) = aT(n/b) + f(n) nlogba f (n) Specifically, f(n) = O(nlogba ) for some constant > 0. Solution: T(n) = (nlogba) . 6

  7. Master Method CASE 2 T(n) = aT(n/b) + f(n) nlogba f (n) Specifically, f(n) = (nlogba lgkn) for some constant k 0. Solution: T(n) = (nlogba lgk+1n) . 7

  8. Master Method CASE 3 T(n) = aT(n/b) + f(n) nlogba f (n) Specifically, f(n) = (nlogba + ) for some constant > 0 andf(n) satisfies the regularity condition that af(n/b) cf(n) for some constant c < 1. Solution: T(n) = (f(n)) . 8

  9. Master Method Summary T(n) = aT(n/b) + f(n) CASE 1: f(n) = O(nlogba ), constant > 0 T(n) = (nlogba) . CASE 2: f(n) = (nlogba lgkn), constant k 0 T(n) = (nlogba lgk+1n) . CASE 3: f(n) = (nlogba + ), constant > 0, and regularity condition T(n) = (f(n)) . 9

  10. Master Method Quiz T(n) = 4 T(n/2) + n nlogba = n2 n)CASE 1:T(n) = (n2). T(n) = 4 T(n/2) + n2 nlogba = n2 = n2 lg0n)CASE 2:T(n) = (n2lg n). T(n) = 4 T(n/2) + n3 nlogba = n2 n3)CASE 3:T(n) = (n3). T(n) = 4 T(n/2) + n2/lg n Master method does not apply! 10

  11. LECTURE 2 Recurrences (Review) Recurrences (Review) Matrix Multiplication Matrix Multiplication Merge Sort Merge Sort Tableau Construction Tableau Construction Conclusion Conclusion 11

  12. Square-Matrix Multiplication a11a12 a21a22 a1n a2n b11b12 b21b22 b1n b2n c11c12 c21c22 c1n c2n = cnn C ann A bnn B an1an2 bn1bn2 cn1cn2 n cij= aik bkj k = 1 Assume for simplicity that n = 2k. 12

  13. Recursive Matrix Multiplication Divide and conquer C11 C12 A11 A12 B11 B12 = C21 C22 A21 A22 B21 B22 A11B11A11B12 A12B21A12B22 = + A21B11A21B12 A22B21A22B22 8 multiplications of (n/2) (n/2) matrices. 1 addition of n n matrices. 13

  14. Matrix Multiply in Pseudo-Cilk cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h hbase case & partition matrices i i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A B Absence of type declarations. 14

  15. Matrix Multiply in Pseudo-Cilk cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h hbase case & partition matrices i i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A B Coarsen base cases for efficiency. 15

  16. Matrix Multiply in Pseudo-Cilk cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h hbase case & partition matrices i i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A B Also need a row- size argument for array indexing. Submatrices are produced by pointer calculation, not copying of elements. 16

  17. Matrix Multiply in Pseudo-Cilk cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h hbase case & partition matrices i i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } C = A B C = C + T cilk void Add(*C, *T, n) { h h base case & partition matrices i i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } 17

  18. Work of Matrix Addition cilk void Add(*C, *T, n) { h hbase case & partition matrices i i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } A1(n) = ? 4A1(n/2) + (1) = (n2) Work: CASE 1 nlogba = nlog24 = n2 (1) . 18

  19. Span of Matrix Addition cilk void Add(*C, *T, n) { h hbase case & partition matrices i i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } } cilk void Add(*C, *T, n) { h hbase case & partition matrices i i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; maximum A1(n/2) + (1) = (lg n) Span: A1(n) = ? CASE 2 nlogba = nlog21 = 1 )f(n) = (nlogba lg0n) . 19

  20. Work of Matrix Multiplication cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h hbase case & partition matrices i i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } 8 8M1(n/2) +A1(n) + (1) = 8M1(n/2) + (n2) = (n3) CASE 1 Work: M1(n) = ? nlogba = nlog28 = n3 (n2) . 20

  21. Span of Matrix Multiplication cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h hbase case & partition matrices i i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); h hbase case & partition matrices i i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; 8 M1(n/2) + A1(n) + (1) = M1(n/2) + (lg n) = (lg2 n) CASE 2 Span: M1(n) = ? nlogba = nlog21 = 1 )f(n) = (nlogba lg1n) . 21

  22. Parallelism of Matrix Multiply M1(n) = (n3) M1(n) = (lg2n) Work: Span: M1(n) M1(n) = (n3/lg2n) Parallelism: For 1000 1000 matrices, parallelism (103)3/102 = 107. 22

  23. Stack Temporaries cilk void Mult(*C, *A, *B, n) { h hbase case & partition matrices i i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } float *T = Cilk_alloca(n*n*sizeof(float)); In hierarchical-memory machines (especially chip multiprocessors), memory accesses are so expensive that minimizing storage often yields higher performance. IDEA: Trade off parallelism for less storage. 23

  24. No-Temp Matrix Multiplication cilk void MultA(*C, *A, *B, n) { // C = C + A * B h hbase case & partition matrices i i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } Saves space, but at what expense? 24

  25. Work of No-Temp Multiply cilk void MultA(*C, *A, *B, n) { // C = C + A * B h hbase case & partition matrices i i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } 8 M1(n/2) + (1) CASE 1 Work: M1(n) = ? = (n3) 25

  26. Span of No-Temp Multiply cilk void MultA(*C, *A, *B, n) { // C = C + A * B h hbase case & partition matrices i i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; } } M1(n) = ? Span: cilk void MultA(*C, *A, *B, n) { // C = C + A * B h hbase case & partition matrices i i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawn MultA(C11,A12,B21,n/2); sync; return; maximum maximum 2 M1(n/2) + (1) = (n) CASE 1 26

  27. Parallelism of No-Temp Multiply M1(n) = (n3) M1(n) = (n) Work: Span: M1(n) M1(n) = (n2) Parallelism: For 1000 1000 matrices, parallelism (103)3/103 = 106. Faster in practice! 27

  28. Testing Synchronization Cilk language feature: A programmer can check whether a Cilk procedure is synched (without actually performing a sync) by testing the pseudovariableSYNCHED: SYNCHED = 0 ) some spawned children might not have returned. SYNCHED = 1 ) all spawned children have definitely returned. 28

  29. Best of Both Worlds cilk void Mult1(*C, *A, *B, n) {// multiply & store h hbase case & partition matrices i i spawn Mult1(C11,A11,B11,n/2); // multiply & store spawn Mult1(C12,A11,B12,n/2); spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) { spawn MultA1(C11,A12,B21,n/2); // multiply & add spawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2); spawn MultA1(C21,A22,B21,n/2); } else { float *T = Cilk_alloca(n*n*sizeof(float)); spawn Mult1(T11,A12,B21,n/2); // multiply & store spawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2); sync; spawn Add(C,T,n); // C = C + T } sync; return; } This code is just as parallel as the original, but it only uses more space if runtime parallelism actually exists. 29

  30. Ordinary Matrix Multiplication cij= k = 1 n aik bkj IDEA: Spawn n2 inner products in parallel. Compute each inner product in parallel. Work: (n3) Span: (lgn) Parallelism: (n3/lgn) BUT, this algorithm exhibits poor locality and does not exploit the cache hierarchy of modern microprocessors, especially CMP s. 30

  31. LECTURE 2 Recurrences (Review) Recurrences (Review) Matrix Multiplication Matrix Multiplication Merge Sort Merge Sort Tableau Construction Tableau Construction Conclusion Conclusion 31

  32. Merging Two Sorted Arrays void Merge(int *C, int *A, int *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } } while (na>0) { *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } } Time to merge n elements = ? (n). 3 3 12 12 19 19 46 46 4 4 14 14 21 21 23 23 32

  33. Merge Sort cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } } 14 3 4 12 merge 19 21 33 46 3 12 19 46 4 14 21 33 merge 3 19 12 46 4 33 14 21 merge 19 3 12 46 33 4 21 14 33

  34. Work of Merge Sort cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } } 2T1(n/2) + (n) Work: T1(n) = ? = (n lg n) CASE 2 nlogba = nlog22 = n)f(n) = (nlogba lg0n) . 34

  35. Span of Merge Sort cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } } T1(n/2) + (n) CASE 3 = (n) Span: T1(n) = ? nlogba = nlog21 = 1 (n) . 35

  36. Parallelism of Merge Sort T1(n) = (n lg n) T1(n) = (n) Work: PUNY ! Span: T1(n) T1(n) = (lg n) Parallelism: We need to parallelize the merge! 36

  37. Parallel Merge na/2 0 na A A[na/2] A[na/2] Recursive merge Recursive merge Binary search B na nb A[na/2] A[na/2] 0 nb j+1 j KEY IDEA: If the total number of elements to be merged in the two arrays is n = na + nb, the total number of elements in the larger of the two recursive merges is at most ? (3/4)n . 37

  38. Parallel Merge cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { spawn P_Merge(C, B, A, nb, na); } else if (na==1) { if (nb == 0) { C[0] = A[0]; } else { C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */ C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */ } } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } } Coarsen base cases for efficiency. 38

  39. Span of P_Merge cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } } T1(3n/4) + (lg n) CASE 2 = (lg2n) Span: T1(n) = ? nlogba = nlog4/31 = 1 )f(n) = (nlogba lg1n) . 39

  40. Work of P_Merge cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } } HA IRY ! T1(n) = ? T1( n) + T1((1 )n) + (lg n), where 1/4 3/4 . CLAIM:T1(n) = (n) . Work: 40

  41. Analysis of Work Recurrence T1(n) = T1( n) + T1((1 )n) + (lg n), where 1/4 3/4 . Substitution method: Inductive hypothesis is T1(k) c1k c2lg k, where c1,c2 > 0. Prove that the relation holds, and solve for c1 and c2. T1(n) = T1( n) + T1((1 )n) + (lg n) c1( n) c2lg( n) + c1((1 )n) c2lg((1 )n) + (lg n) 41

  42. Analysis of Work Recurrence T1(n) = T1( n) + T1((1 )n) + (lg n), where 1/4 3/4 . T1(n) = T1( n) + T1((1 )n) + (lg n) c1( n) c2lg( n) + c1(1 )n c2lg((1 )n) + (lg n) 42

  43. Analysis of Work Recurrence T1(n) = T1( n) + T1((1 )n) + (lg n), where 1/4 3/4 . T1(n) = T1( n) + T1((1 )n) + (lg n) c1( n) c2lg( n) + c1(1 )n c2lg((1 )n) + (lg n) c1n c2lg( n) c2lg((1 )n) + (lg n) c1n c2 ( lg( (1 )) + 2lg n ) + (lg n) c1n c2 lg n (c2(lg n + lg( (1 ))) (lg n)) c1n c2 lg n by choosing c1 and c2 large enough. 43

  44. Parallelism of P_Merge T1(n) = (n) T1(n) = (lg2n) Work: Span: T1(n) T1(n) = (n/lg2n) Parallelism: 44

  45. Parallel Merge Sort cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } } 45

  46. Work of Parallel Merge Sort cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } } T1(n) = 2 T1(n/2) + (n) = (n lg n) Work: CASE 2 46

  47. Span of Parallel Merge Sort cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } } T1(n/2) + (lg2n) CASE 2 = (lg3n) Span: T1(n) = ? nlogba = nlog21 = 1 )f(n) = (nlogba lg2n) . 47

  48. Parallelism of Merge Sort T1(n) = (n lg n) T1(n) = (lg3n) Work: Span: T1(n) T1(n) = (n/lg2n) Parallelism: 48

  49. LECTURE 2 Recurrences (Review) Recurrences (Review) Matrix Multiplication Matrix Multiplication Merge Sort Merge Sort Tableau Construction Tableau Construction Conclusion Conclusion 49

  50. Tableau Construction Problem: Fill in an n n tableau A, where A[i, j] = f(A[i, j 1], A[i 1, j], A[i 1, j 1]). Dynamic programming Longest common subsequence Edit distance Time warping 00 01 02 03 04 05 06 07 10 11 12 13 14 15 16 17 20 21 22 23 24 25 26 27 30 31 32 33 34 35 36 37 40 41 42 43 44 45 46 47 50 51 52 53 54 55 56 57 Work: (n2). 60 61 62 63 64 65 66 67 70 71 72 73 74 75 76 77 50

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#