Overview of Sparse Linear Solvers and Gaussian Elimination

 
L
e
c
t
u
r
e
 
3
S
p
a
r
s
e
 
D
i
r
e
c
t
 
M
e
t
h
o
d
:
 
C
o
m
b
i
n
a
t
o
r
i
c
s
 
Xiaoye Sherry Li
Lawrence Berkeley National Laboratory
, USA
xsli@lbl.gov
 
crd-legacy.lbl.gov/~xiaoye/G2S3/
 
4
th
 Gene Golub SIAM Summer School, 7/22 – 8/7, 2013, Shanghai
 
L
e
c
t
u
r
e
 
o
u
t
l
i
n
e
 
Linear solvers: direct, iterative, hybrid
Gaussian elimination
Sparse Gaussian elimination: elimination graph, elimination tree
Symbolic factorization, ordering, graph traversal
only integers, no FP involved
 
2
 
Solving a system of linear equations Ax = b
Sparse:  many zeros in A;  worth special treatment
Iterative methods (CG, GMRES, …)
A is not changed (read-only)
Key kernel: sparse matrix-vector multiply
Easier to optimize and parallelize
Low algorithmic complexity, but may not converge
Direct methods
A is modified (factorized)
Harder to optimize and parallelize
Numerically robust, but higher algorithmic complexity
Often use direct method (factorization) to 
precondition
 iterative method
Solve an easy system: M
-1
Ax = M
-1
b
S
t
r
a
t
e
g
i
e
s
 
o
f
 
s
p
a
r
s
e
 
l
i
n
e
a
r
 
s
o
l
v
e
r
s
3
 
G
a
u
s
s
i
a
n
 
E
l
i
m
i
n
a
t
i
o
n
 
(
G
E
)
 
Solving a system of linear equations Ax = b
 
First step of GE
 
 
 
 
 
Repeat GE on C
Result in LU factorization (A = LU)
L lower triangular with unit diagonal, U upper triangular
 
Then, x is obtained by solving two triangular systems with L
and U
 
 
4
 
 
Numerical Stability: Need for Pivoting
 
One step of GE:
 
 
 
 
 
If 
α
 small, some entries in B may be lost from addition
 
Pivoting
: swap the current diagonal with a larger entry from
the other part of the matrix
 
Goal: control element growth (pivot growth) in L & U
 
 
5
 
 
 
S
p
a
r
s
e
 
G
E
 
Goal: Store only nonzeros and perform operations only on
nonzeros
Scalar algorithm: 3 nested loops
Can re-arrange loops to get different variants: left-looking, right-
looking, . . .
Fill:
 
new nonzeros in factor
 
 
 
 
 
 
 
 
Typical fill-ratio: 10x for 2D problems, 30-50x for 3D problems
 
6
for 
i = 1 to n
    A(:
,j) = A(:,j) / A(j,j)   % 
cdiv
(j)
 col_scale
    
for
 k = i+1 to n  s.t.  A(i,k) != 0
       
for
 j = i+1 to n  s.t.  A(j,i) != 0
           A(j,k) = A(j,k) - A(j,i) * A(i,k)
U
s
e
f
u
l
 
t
o
o
l
 
t
o
 
d
i
s
c
o
v
e
r
 
f
i
l
l
:
 
R
e
a
c
h
a
b
l
e
 
S
e
t
 
Given certain elimination order (x
1
, x
2
, . . ., x
n
), how do you determine
the fill-ins 
using original graph of A
 ?
A
n
 
i
m
p
l
i
c
i
t
 
e
l
i
m
i
n
a
t
i
o
n
 
m
o
d
e
l
Definition: Let S be a subset of the node set. The reachable set of y
through S is:
 
Reach(y, 
S
) = { x | there exists a directed path (y,
v
1
,…v
k
, x), v
i
 in 
S
}
 
Fill-p
ath theorem
 [Rose/Tarjan 
78] (general case):
 
Let G(A) = (V,E) be a directed graph of A, then an edge (v,w) exists in
the filled graph G
+
(A) if and only if
 
 
 
G
+
(
A
)
 
=
 
g
r
a
p
h
 
o
f
 
t
h
e
 
{
L
,
U
}
 
f
a
c
t
o
r
s
8
C
o
n
c
e
p
t
 
o
f
 
r
e
a
c
h
a
b
l
e
 
s
e
t
,
 
f
i
l
l
-
p
a
t
h
Edge (x,y) exists in filled graph G
+
 due to the path: 
x 
 7 
 3 
 9 
 y
Finding fill-ins 

  finding transitive closure of G(A)
 
S
p
a
r
s
e
 
C
o
l
u
m
n
 
C
h
o
l
e
s
k
y
 
F
a
c
t
o
r
i
z
a
t
i
o
n
 
L
L
T
 
for
 j = 1 : n
 
   L(j:n, j) = A(j:n, j);
   
for
 k < j 
with
 L(j, k) 
nonzero
      
% sparse cmod(j,k)
      L(j:n, j) = L(j:n, j) – L(j, k) * L(j:n, k);
   end
;
 
   
% sparse cdiv(j)
   L(j, j) = sqrt(L(j, j));
   L(j+1:n, j) = L(j+1:n, j) / L(j, j);
 
end
;
 
Column j of A becomes column j of L
 
Fill-path theorem 
[George 
80] (symmetric case)
     After x
1
, …, x
i
 are eliminated, the set of nodes adjacent to y in the
     elimination graph is given by 
Reach(y, {x
1
, …, x
i
})
, x
i
<min(x,y)
 
E
l
i
m
i
n
a
t
i
o
n
 
T
r
e
e
 
Cholesky factor L
 
G
+
(A)
 
T(A)
 
T
(
A
)
 
:
 
 
 
p
a
r
e
n
t
(
j
)
 
=
 
m
i
n
 
{
 
i
 
>
 
j
 
:
 
(
i
,
 
j
)
 
i
n
 
G
+
(
A
)
 
}
 
parent(col j) = first nonzero row below diagonal in L
 
T describes dependencies among columns of factor
C
a
n
 
c
o
m
p
u
t
e
 
G
+
(
A
)
 
e
a
s
i
l
y
 
f
r
o
m
 
T
Can compute T from G(A) in almost linear time
 
11
 
S
S
y
y
m
m
b
b
o
o
l
l
i
i
c
c
 
 
F
F
a
a
c
c
t
t
o
o
r
r
i
i
z
z
a
a
t
t
i
i
o
o
n
n
 
p
p
r
r
e
e
c
c
u
u
r
r
s
s
o
o
r
r
 
 
t
t
o
o
 
 
n
n
u
u
m
m
e
e
r
r
i
i
c
c
a
a
l
l
 
 
f
f
a
a
c
c
t
t
o
o
r
r
i
i
z
z
a
a
t
t
i
i
o
o
n
n
Elimination tree
Nonzero counts
Supernodes
Nonzero structure of {L, U}
 
Cholesky 
[Davis’06 book, George/Liu’81 book]
Use elimination graph of L and its transitive reduction (elimination tree)
Complexity linear in output: O(nnz(L))
LU
Use elimination graphs of L & U and their transitive reductions
(elimination DAGs) 
[Tarjan/Rose `78, Gilbert/Liu `93, Gilbert `94]
Improved by symmetric structure pruning 
[Eisenstat/Liu `92]
Improved by supernodes
Complexity greater than nnz(L+U), but much smaller than flops(LU)
 
 
C
a
n
 
w
e
 
r
e
d
u
c
e
 
f
i
l
l
?
 
Reordering, permutation
 
12
 
Fill-in in Sparse GE
 
Original zero entry A
ij
 
becomes nonzero in L or U
Red: fill-ins
 
Natural order: NNZ = 233
                 
Min. Degree order: NNZ = 207
 
13
 
Ordering : Minimum Degree (1/3)
 
Graph game
:
 
i     j     k     l
 
1
 
i
 
j
 
k
 
l
 
14
 
Maximum fill: 
all the edges between neighboring vertices (“clique”)
 
Minimum Degree Ordering (2/3)
 
Greedy approach: 
do the best locally
Best for modest size problems
Hard to parallelize
 
At each step
 Eliminate the vertex with the smallest degree
 Update degrees of the neighbors
 
Straightforward implementation is slow and requires too much memory
Newly added edges are more than eliminated vertices
 
15
 
Minimum Degree Ordering (3/3)
 
Use 
quotient graph (QG) 
as a compact representation
[George/Liu 
78]
Collection of cliques resulting from the eliminated vertices
affects the degree of an uneliminated vertex
Represent each connected component in the eliminated
subgraph by a single 
supervertex
Storage required to implement QG model is bounded by
size of A
Large body of literature on implementation variants
Tinney/Walker `67, George/Liu `79, Liu `85,
Amestoy/Davis/Duff  `94, Ashcraft `95, Duff/Reid `95, et al.,
. .
 
Extended the QG model to nonsymmetric using bipartite
graph [Amestoy/Li/Ng `07]
 
 
16
 
Ordering : Nested Dissection
 
Model problem: discretized system Ax = b from certain
PDEs, e.g., 5-point stencil on  k x k grid,  n = k
2
Recall fill-path theorem:
     After x
1
, …, x
i
 are eliminated, the set of nodes adjacent to y in the
     elimination graph is given by 
Reach(y, {x
1
, …, x
i
})
, x
i
<min(x,y)
 
17
 
N
D
 
o
r
d
e
r
i
n
g
:
 
r
e
c
u
r
s
i
v
e
 
a
p
p
l
i
c
a
t
i
o
n
 
o
f
 
b
i
s
e
c
t
i
o
n
 
18
 
ND gives a separator
tree (i.e elimination tree)
19
N
D
 
a
n
a
l
y
s
i
s
 
o
n
 
a
 
s
q
u
a
r
e
 
g
r
i
d
 
(
k
 
x
 
k
 
=
 
n
)
Theorem 
[George 
73, Hoffman/Martin/Ross]
: ND
 
ordering gave
optimal complexity in exact factorization.
    Proof:
A
p
p
l
y
 
N
D
 
b
y
 
a
 
s
e
q
u
e
n
c
e
 
o
f
 
 
+
 
s
e
p
a
r
a
t
o
r
s
B
y
 
r
e
a
c
h
a
b
l
e
 
s
e
t
 
a
r
g
u
m
e
n
t
,
 
a
l
l
 
t
h
e
 
s
e
p
a
r
a
t
o
r
s
 
a
r
e
 
e
s
s
e
n
t
i
a
l
l
y
 
d
e
n
s
e
s
u
b
m
a
t
r
i
c
e
s
F
i
l
l
-
i
n
 
e
s
t
i
m
a
t
i
o
n
:
 
a
d
d
 
u
p
 
t
h
e
 
n
o
n
z
e
r
o
s
 
i
n
 
t
h
e
 
s
e
p
a
r
a
t
o
r
s
 
C
o
m
p
l
e
x
i
t
y
 
o
f
 
d
i
r
e
c
t
 
m
e
t
h
o
d
s
 
Time and
space to solve
any problem
on any well-
shaped finite
element mesh
ND Ordering: generalization
 
Generalized nested dissection
 [Lipton/Rose/Tarjan 
79]
Global graph partitioning: top-down, divide-and-conqure
First level
 
 
 
Recurse on A and B
Goal: find the smallest possible separator S at each level
Multilevel schemes:
(Par)
Metis [Karypis/Kumar `95], Chaco
[Hendrickson/Leland `94], (PT-)Scotch [Pellegrini et
al.`07]
Spectral bisection [Simon et al. `90-`95]
Geometric and spectral bisection [Chan/Gilbert/Teng `94]
21
 
ND Ordering
 
 
22
 
 
C
M
 
/
 
R
C
M
 
O
r
d
e
r
i
n
g
 
Cuthill-McKee, Reverse Cuthill-McKee
Reduce bandwidth
Construct 
level sets 
via 
breadth-first search, 
start
 
from the vertex of
minimum degree
At any level, priority is given to a vertex with smaller number of
neighbors
 
 
 
 
 
 
 
RCM: Simply reverse the ordering found by CM
 
 
 
 
 
 
 
23
 
[Duff, Erisman, Reid]
 
RCM good for envelop (profile) Solver
(also good for SpMV)
 
Define bandwidth for each row  or column
Data structure a little more sophisticated than band solver, but simpler
than general sparse solver
 
Use Skyline storage (SKS)
Lower triangle stored row by row
 
Upper triangle stored column by column
In each row (column), first nonzero
 
defines a profile
All entries within the profile
 
(some may be zeros) are stored
All the fill is contained inside the profile
 
A good ordering would be based on bandwidth reduction
E.g., Reverse Cuthill-McKee
 
24
 
Envelop (profile) solver (2/2)
 
Lemma: env(L+U) = env(A)
No more fill-ins generated outside the envelop!
 
 
Inductive proof:  After N-1 steps,
 
 
25
 
Envelop vs. general solvers
 
Example: 3 orderings (natural, RCM, MD)
Envelop size = sum of bandwidths
 
26
 
Env = 31775
 
Env = 22320
 
Env = 61066
NNZ(L, MD) = 12259
 
27
 
O
r
d
e
r
i
n
g
 
f
o
r
 
u
n
s
y
m
m
e
t
r
i
c
 
L
U
 
 
s
y
m
m
e
t
r
i
z
a
t
i
o
n
 
Can use a symmetric ordering on a symmetrized matrix . . .
 
Case of partial pivoting (sequential SuperLU):
 
Use ordering based on 
A
T
A
If  R
T
R = A
T
A and PA = LU, then for any row permutation P,
      struct(L+U) 
 struct(R
T
+R)  [George/Ng `87]
Making R sparse tends to make L & U sparse . . .
 
Case of diagonal pivoting (static pivoting in SuperLU_DIST):
 
Use ordering based on 
A
T
+A
If  R
T
R = A
T
+A and A = LU, then struct(L+U) 
 struct(R
T
+R)
Making R sparse tends to make L & U sparse . . .
28
U
n
s
y
m
m
e
t
r
i
c
 
v
a
r
i
a
n
t
 
o
f
 
M
i
n
 
D
e
g
r
e
e
 
o
r
d
e
r
i
n
g
(
M
a
r
k
o
w
i
t
z
 
s
c
h
e
m
e
)
29
R
e
s
u
l
t
s
 
o
f
 
M
a
r
k
o
w
i
t
z
 
o
r
d
e
r
i
n
g
 
[
A
m
e
s
t
o
y
/
L
i
/
N
g
0
2
]
 
Extend the QG model to bipartite quotient graph
Same asymptotic complexity as symmetric MD
S
p
a
c
e
 
i
s
 
b
o
u
n
d
e
d
 
b
y
 
2
*
(
m
 
+
 
n
)
T
i
m
e
 
i
s
 
b
o
u
n
d
e
d
 
b
y
 
O
(
n
 
*
 
m
)
 
For 50+ unsym. matrices, compared with MD on A
+A:
R
e
d
u
c
t
i
o
n
 
i
n
 
f
i
l
l
:
 
a
v
e
r
a
g
e
 
0
.
8
8
,
 
b
e
s
t
 
0
.
3
8
R
e
d
u
c
t
i
o
n
 
i
n
 
F
P
 
o
p
e
r
a
t
i
o
n
s
:
 
a
v
e
r
a
g
e
 
0
.
7
7
,
 
b
e
s
t
 
0
.
0
1
 
How about graph partitioning for unsymmetric LU?
H
y
p
e
r
g
r
a
p
h
 
p
a
r
t
i
t
i
o
n
 
[
B
o
m
a
n
,
 
G
r
i
g
o
r
i
,
 
e
t
 
a
l
.
 
`
0
8
]
S
i
m
i
l
a
r
 
t
o
 
N
D
 
o
n
 
A
T
A
,
 
b
u
t
 
n
o
 
n
e
e
d
 
t
o
 
c
o
m
p
u
t
e
 
A
T
A
 
Remark: Dense vs. Sparse GE
 
Dense GE:    P
r
 A P
c
 = LU
P
r 
and P
c
 are permutations chosen to maintain stability
Partial pivoting suffices in most cases :  P
r 
A = LU
 
Sparse GE:   P
r
 A P
c
 = LU
P
r 
and P
c
 are chosen to maintain stability, 
preserve sparsity, increase
parallelism
Dynamic pivoting causes dynamic structural change
A
l
t
e
r
n
a
t
i
v
e
s
:
 
t
h
r
e
s
h
o
l
d
 
p
i
v
o
t
i
n
g
,
 
s
t
a
t
i
c
 
p
i
v
o
t
i
n
g
,
 
.
 
.
 
.
 
 
 
30
31
N
u
m
e
r
i
c
a
l
 
P
i
v
o
t
i
n
g
 
Goal of pivoting is to control element growth in L & U for stability
For sparse factorizations, often relax the pivoting rule to trade with better
sparsity and parallelism (e.g., threshold pivoting, static pivoting , . . .)
 
Partial pivoting 
used in sequential SuperLU  and SuperLU_MT (GEPP)
Can force diagonal pivoting (controlled by diagonal
 
threshold)
Hard to implement 
scalably
 for sparse factorization
 
Static pivoting 
used in SuperLU_DIST (GESP)
Before factor, scale and permute A to maximize diagonal: P
r 
D
r
 A D
c
 = A
During factor A
 = LU, replace tiny pivots by 
  
       , without changing data
structures for L & U
If needed, use a few steps of iterative refinement after the first solution
 quite stable in practice
 
U
s
e
 
m
a
n
y
 
h
e
u
r
i
s
t
i
c
s
 
Finding an optimal fill-reducing ordering is NP-complete 
 use
heuristics:
Local approach: Minimum degree
Global approach: Nested dissection 
(optimal in special case)
, RCM
Hybrid: First permute the matrix globally to confine the fill-in, then
reorder small parts using local heuristics
L
o
c
a
l
 
m
e
t
h
o
d
s
 
e
f
f
e
c
t
i
v
e
 
f
o
r
 
s
m
a
l
l
e
r
 
g
r
a
p
h
,
 
g
l
o
b
a
l
 
m
e
t
h
o
d
s
e
f
f
e
c
t
i
v
e
 
f
o
r
 
l
a
r
g
e
r
 
g
r
a
p
h
 
Numerical pivoting: trade-off stability with sparsity and parallelism
Partial pivoting too restrictive
Threshold pivoting
Static pivoting
 
 
32
 
Algorithmic phases in sparse GE
 
1.
Minimize number of fill-ins, maximize parallelism
Sparsity structure of L & U depends on that of A, which can be changed by
row/column permutations (vertex re-labeling of the underlying graph)
Ordering
 (combinatorial algorithms; 
NP-complete
 to find optimum
[Yannakis 
83]; use heuristics)
 
2.
Predict the fill-in positions in L & U
Symbolic factorization
 (combinatorial algorithms)
 
3.
Design efficient data structure for storage and quick retrieval of the
nonzeros
Compressed storage schemes
 
4.
Perform factorization and triangular solutions
Numerical algorithms
 (F.P. operations only on nonzeros)
Usually dominate the total runtime
 
For sparse Cholesky and QR, the steps can be separate;
     for sparse LU with pivoting, steps 2 and 4 my be interleaved.
 
33
 
R
e
f
e
r
e
n
c
e
s
 
T. Davis, Direct Methods for Sparse Linear Systems, SIAM, 2006. (book)
A. George and J. Liu, 
Computer Solution of Large Sparse Positive Definite
Systems
, Prentice Hall, 1981. (book)
I. Duff, I. Erisman and J. Reid, 
Direct Methods for Sparse Matrices
, Oxford
University Press, 1986. (book)
C. Chevalier, F. Pellegrini, “PT-Scotch”, Parallel Computing, 34(6-8), 318-331,
2008. http://www.labri.fr/perso/pelegrin/scotch/
G. Karypis, K. Schloegel, V. Kumar, ParMetis: Parallel Graph Partitioning and
Sparse Matrix Ordering Library”, University of Minnesota. http://www-
users.cs.umn.edu/~karypis/metis/parmetis/
E. Boman, K. Devine, et al., “Zoltan, Parallel Partitioning, Load Balancing and
Data-Management Services”, Sandia National Laboratories.
      http://www.cs.sandia.gov/Zoltan/
J. Gilbert, “Predicting structures in sparse matrix computations”, SIAM. J.
Matrix Anal. & App, 
15(1), 62–79, 1994.
J.W.H. Liu, “Modification of the minimum degree algorithm by multiple
elimination”, ACM Trans. Math. Software, Vol. 11, 141-153, 1985.
T.A. Davis, J.R. Gilbert, S. Larimore, E. Ng, “A column approximate minimum
degree ordering algorithm”, ACM Trans. Math. Software, 30 (3), 353-376, 2004
P. Amestoy, X.S. Li and E. Ng, “Diagonal Markowitz Scheme with Local
Symmetrization”, SIAM J. Matrix Anal. Appl., Vol. 29, No. 1, pp. 228-244,
2007.
 
34
 
E
x
e
r
c
i
s
e
s
 
Homework3 in Hands-On-Exercises/ directory
Show that:
     If  R
T
R = A
T
+A and A = LU, then struct(L+U)  struct(R
T
+R)
Show that: 
[George/Ng `87]
     If  R
T
R = A
T
A and PA = LU, then for any row permutation P,
     struct(L+U)  struct(R
T
+R)
 
35
Slide Note

July 13, TNLIST, Tsinghua Univ.

Embed
Share

Exploring Sparse Linear Solvers and Gaussian Elimination methods in solving systems of linear equations, emphasizing strategies, numerical stability considerations, and the unique approach of Sparse Gaussian Elimination. Topics include iterative and direct methods, factorization, matrix-vector multiplication, and pivoting for numerical stability. The lecture outlines key concepts and techniques for efficient and accurate computation of sparse matrices in linear algebra applications.

  • Sparse Linear Solvers
  • Gaussian Elimination
  • Iterative Methods
  • Direct Methods
  • Numerical Stability
  • Combinatorics
  • Sparse Gaussian Elimination

Uploaded on Sep 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. 4th Gene Golub SIAM Summer School, 7/22 8/7, 2013, Shanghai Lecture 3 Sparse Direct Method: Combinatorics Xiaoye Sherry Li Lawrence Berkeley National Laboratory, USA xsli@lbl.gov crd-legacy.lbl.gov/~xiaoye/G2S3/

  2. Lecture outline Linear solvers: direct, iterative, hybrid Gaussian elimination Sparse Gaussian elimination: elimination graph, elimination tree Symbolic factorization, ordering, graph traversal only integers, no FP involved 2

  3. Strategies of sparse linear solvers Solving a system of linear equations Ax = b Sparse: many zeros in A; worth special treatment Iterative methods (CG, GMRES, ) A is not changed (read-only) Key kernel: sparse matrix-vector multiply Easier to optimize and parallelize Low algorithmic complexity, but may not converge Direct methods A is modified (factorized) Harder to optimize and parallelize Numerically robust, but higher algorithmic complexity Often use direct method (factorization) to precondition iterative method Solve an easy system: M-1Ax = M-1b 3

  4. Gaussian Elimination (GE) Solving a system of linear equations Ax = b First step of GE 1 0 w w T T = = A / v I 0 v B C v w T = C B Repeat GE on C Result in LU factorization (A = LU) L lower triangular with unit diagonal, U upper triangular Then, x is obtained by solving two triangular systems with L and U 4

  5. Numerical Stability: Need for Pivoting One step of GE: 1 0 w w T T = = A / v I 0 v B C v w T B C = If small, some entries in B may be lost from addition Pivoting: swap the current diagonal with a larger entry from the other part of the matrix Goal: control element growth (pivot growth) in L & U 5

  6. Sparse GE Goal: Store only nonzeros and perform operations only on nonzeros Scalar algorithm: 3 nested loops Can re-arrange loops to get different variants: left-looking, right- looking, . . . Fill: new nonzeros in factor U for i = 1 to n A(:,j) = A(:,j) / A(j,j) % cdiv(j) col_scale for k = i+1 to n s.t. A(i,k) != 0 for j = i+1 to n s.t. A(j,i) != 0 A(j,k) = A(j,k) - A(j,i) * A(i,k) 1 2 3 4 L 5 6 7 Typical fill-ratio: 10x for 2D problems, 30-50x for 3D problems 6

  7. Useful tool to discover fill: Reachable Set Given certain elimination order (x1, x2, . . ., xn), how do you determine the fill-ins using original graph of A ? An implicit elimination model Definition: Let S be a subset of the node set. The reachable set of y through S is: Reach(y, S) = { x | there exists a directed path (y,v1, vk, x), vi in S} Fill-path theorem [Rose/Tarjan 78] (general case): Let G(A) = (V,E) be a directed graph of A, then an edge (v,w) exists in the filled graph G+(A) if and only if w Reach(v, {v1, vk}), where, vi<min(v,w), 1 i k G+(A) = graph of the {L,U} factors

  8. Concept of reachable set, fill-path 3 + + 7 + o + 9 + + y x + o o Edge (x,y) exists in filled graph G+ due to the path: x 7 3 9 y Finding fill-ins finding transitive closure of G(A) 8

  9. Sparse Column Cholesky Factorization LLT for j = 1 : n Column j of A becomes column j of L j L(j:n, j) = A(j:n, j); for k < j with L(j, k) nonzero % sparse cmod(j,k) L(j:n, j) = L(j:n, j) L(j, k) * L(j:n, k); end; LT L A % sparse cdiv(j) L(j, j) = sqrt(L(j, j)); L(j+1:n, j) = L(j+1:n, j) / L(j, j); L end; Fill-path theorem [George 80] (symmetric case) After x1, , xi are eliminated, the set of nodes adjacent to y in the elimination graph is given by Reach(y, {x1, , xi}), xi<min(x,y)

  10. Elimination Tree 3 10 7 1 9 5 6 8 4 8 2 4 10 7 3 6 9 2 1 5 G+(A) T(A) Cholesky factor L T(A) : parent(j) = min { i > j : (i, j) in G+(A) } parent(col j) = first nonzero row below diagonal in L T describes dependencies among columns of factor Can compute G+(A) easily from T Can compute T from G(A) in almost linear time

  11. Symbolic Factorization precursor to numerical factorization Elimination tree Nonzero counts Supernodes Nonzero structure of {L, U} Cholesky [Davis 06 book, George/Liu 81 book] Use elimination graph of L and its transitive reduction (elimination tree) Complexity linear in output: O(nnz(L)) LU Use elimination graphs of L & U and their transitive reductions (elimination DAGs) [Tarjan/Rose `78, Gilbert/Liu `93, Gilbert `94] Improved by symmetric structure pruning [Eisenstat/Liu `92] Improved by supernodes Complexity greater than nnz(L+U), but much smaller than flops(LU) 11

  12. Can we reduce fill? Reordering, permutation 1 2 3 4 5 2 2 3 4 5 3 (all filled after elimination) 4 5 1 2 3 4 5 2 2 3 4 5 5 5 4 3 2 1 1 1 1 1 4 = 1 3 1 3 1 1 4 2 1 1 5 5 4 3 2 (no fill after elimination) 12

  13. Fill-in in Sparse GE Original zero entry Aij becomes nonzero in L or U Red: fill-ins Natural order: NNZ = 233 Min. Degree order: NNZ = 207 13

  14. Ordering : Minimum Degree (1/3) Graph game: i j k l i j k l 1 1 x x x x x x x x x x i i x x Eliminate 1 j j x x k k x x l l x x i j i Eliminate 1 j 1 k l k l Maximum fill: all the edges between neighboring vertices ( clique ) 14

  15. Minimum Degree Ordering (2/3) Greedy approach: do the best locally Best for modest size problems Hard to parallelize At each step Eliminate the vertex with the smallest degree Update degrees of the neighbors Straightforward implementation is slow and requires too much memory Newly added edges are more than eliminated vertices 15

  16. Minimum Degree Ordering (3/3) Use quotient graph (QG) as a compact representation [George/Liu 78] Collection of cliques resulting from the eliminated vertices affects the degree of an uneliminated vertex Represent each connected component in the eliminated subgraph by a single supervertex Storage required to implement QG model is bounded by size of A Large body of literature on implementation variants Tinney/Walker `67, George/Liu `79, Liu `85, Amestoy/Davis/Duff `94, Ashcraft `95, Duff/Reid `95, et al., . . Extended the QG model to nonsymmetric using bipartite graph [Amestoy/Li/Ng `07] 16

  17. Ordering : Nested Dissection Model problem: discretized system Ax = b from certain PDEs, e.g., 5-point stencil on k x k grid, n = k2 Recall fill-path theorem: After x1, , xi are eliminated, the set of nodes adjacent to y in the elimination graph is given by Reach(y, {x1, , xi}), xi<min(x,y) 17

  18. ND ordering: recursive application of bisection ND gives a separator tree (i.e elimination tree) 43-49 19-21 40-42 7-9 16-18 28-30 37-39 18

  19. ND analysis on a square grid (k x k = n) Theorem [George 73, Hoffman/Martin/Ross]: ND ordering gave optimal complexity in exact factorization. Proof: Apply ND by a sequence of + separators By reachable set argument, all the separators are essentially dense submatrices Fill-in estimation: add up the nonzeros in the separators k2+4(k/2)2+42(k /4)2+ =O(k2log2k)=O(nlog2n) ( more precisely: 31/4(k2log2k)+O(k2) ) 1 2 6 7 5 10 4 3 8 9 Similarly: Operation count: O(k3)=O(n3/2) 21 16 17 11 12 15 20 18 19 13 14 19

  20. Complexity of direct methods Time and space to solve any problem on any well- shaped finite element mesh n1/2 n1/3 2D 3D Space (fill): O(n log n) O(n 4/3 ) Time (flops): O(n 3/2 ) O(n 2 )

  21. ND Ordering: generalization Generalized nested dissection [Lipton/Rose/Tarjan 79] Global graph partitioning: top-down, divide-and-conqure First level 0 A 0 x B x A S B x x S Recurse on A and B Goal: find the smallest possible separator S at each level Multilevel schemes: (Par)Metis [Karypis/Kumar `95], Chaco [Hendrickson/Leland `94], (PT-)Scotch [Pellegrini et al.`07] Spectral bisection [Simon et al. `90-`95] Geometric and spectral bisection [Chan/Gilbert/Teng `94] 21

  22. ND Ordering 22

  23. CM / RCM Ordering Cuthill-McKee, Reverse Cuthill-McKee Reduce bandwidth Construct level sets via breadth-first search, start from the vertex of minimum degree At any level, priority is given to a vertex with smaller number of neighbors [Duff, Erisman, Reid] RCM: Simply reverse the ordering found by CM 23

  24. RCM good for envelop (profile) Solver (also good for SpMV) Define bandwidth for each row or column Data structure a little more sophisticated than band solver, but simpler than general sparse solver Use Skyline storage (SKS) Lower triangle stored row by row Upper triangle stored column by column In each row (column), first nonzero defines a profile All entries within the profile (some may be zeros) are stored All the fill is contained inside the profile A good ordering would be based on bandwidth reduction E.g., Reverse Cuthill-McKee 24

  25. Envelop (profile) solver (2/2) Lemma: env(L+U) = env(A) No more fill-ins generated outside the envelop! Inductive proof: After N-1 steps, L A w U w 1 1 1 1 = = = , t s . . A A L U 1 1 1 T 1 v v s t T 1 Then, = solve first , nonzero position of w is 1 the same as w L w w 1 1 = solve U first , nonzero position of v the is same as v v v T 1 1 1 25

  26. Envelop vs. general solvers Example: 3 orderings (natural, RCM, MD) Envelop size = sum of bandwidths Env = 61066 NNZ(L, MD) = 12259 Env = 31775 Env = 22320 26

  27. Ordering for unsymmetric LU symmetrization Can use a symmetric ordering on a symmetrized matrix . . . Case of partial pivoting (sequential SuperLU): Use ordering based on ATA If RTR = ATA and PA = LU, then for any row permutation P, struct(L+U) struct(RT+R) [George/Ng `87] Making R sparse tends to make L & U sparse . . . Case of diagonal pivoting (static pivoting in SuperLU_DIST): Use ordering based on AT+A If RTR = AT+A and A = LU, then struct(L+U) struct(RT+R) Making R sparse tends to make L & U sparse . . . 27

  28. Unsymmetric variant of Min Degree ordering (Markowitz scheme) c1 c2 c3 c1 c2 c3 1 1 r1 r1 Eliminate 1 r2 r2 Bipartite graph After a vertex is eliminated, all the row & column vertices adjacent to it become fully connected bi-clique (assuming diagonal pivot) The edges of the bi-clique are the potential fill-ins (upper bound !) 1 1 r1 c1 c1 Eliminate 1 r1 c2 c2 r2 r2 c3 c3 28

  29. Results of Markowitz ordering [Amestoy/Li/Ng02] Extend the QG model to bipartite quotient graph Same asymptotic complexity as symmetric MD Space is bounded by 2*(m + n) Time is bounded by O(n * m) For 50+ unsym. matrices, compared with MD on A +A: Reduction in fill: average 0.88, best 0.38 Reduction in FP operations: average 0.77, best 0.01 How about graph partitioning for unsymmetric LU? Hypergraph partition [Boman, Grigori, et al. `08] Similar to ND on ATA, but no need to compute ATA 29

  30. Remark: Dense vs. Sparse GE Dense GE: Pr A Pc = LU Pr and Pc are permutations chosen to maintain stability Partial pivoting suffices in most cases : Pr A = LU Sparse GE: Pr A Pc = LU Pr and Pc are chosen to maintain stability, preserve sparsity, increase parallelism Dynamic pivoting causes dynamic structural change Alternatives: threshold pivoting, static pivoting, . . . 30

  31. Numerical Pivoting Goal of pivoting is to control element growth in L & U for stability For sparse factorizations, often relax the pivoting rule to trade with better sparsity and parallelism (e.g., threshold pivoting, static pivoting , . . .) Partial pivoting used in sequential SuperLU and SuperLU_MT (GEPP) Can force diagonal pivoting (controlled by diagonal threshold) Hard to implement scalably for sparse factorization x s x x b x x x Static pivoting used in SuperLU_DIST (GESP) Before factor, scale and permute A to maximize diagonal: Pr Dr A Dc = A During factor A = LU, replace tiny pivots by structures for L & U If needed, use a few steps of iterative refinement after the first solution quite stable in practice , without changing data A 31

  32. Use many heuristics Finding an optimal fill-reducing ordering is NP-complete use heuristics: Local approach: Minimum degree Global approach: Nested dissection (optimal in special case), RCM Hybrid: First permute the matrix globally to confine the fill-in, then reorder small parts using local heuristics Local methods effective for smaller graph, global methods effective for larger graph Numerical pivoting: trade-off stability with sparsity and parallelism Partial pivoting too restrictive Threshold pivoting Static pivoting 32

  33. Algorithmic phases in sparse GE 1. Minimize number of fill-ins, maximize parallelism Sparsity structure of L & U depends on that of A, which can be changed by row/column permutations (vertex re-labeling of the underlying graph) Ordering (combinatorial algorithms; NP-complete to find optimum [Yannakis 83]; use heuristics) 2. Predict the fill-in positions in L & U Symbolic factorization (combinatorial algorithms) 3. Design efficient data structure for storage and quick retrieval of the nonzeros Compressed storage schemes 4. Perform factorization and triangular solutions Numerical algorithms (F.P. operations only on nonzeros) Usually dominate the total runtime For sparse Cholesky and QR, the steps can be separate; for sparse LU with pivoting, steps 2 and 4 my be interleaved. 33

  34. References T. Davis, Direct Methods for Sparse Linear Systems, SIAM, 2006. (book) A. George and J. Liu, Computer Solution of Large Sparse Positive Definite Systems, Prentice Hall, 1981. (book) I. Duff, I. Erisman and J. Reid, Direct Methods for Sparse Matrices, Oxford University Press, 1986. (book) C. Chevalier, F. Pellegrini, PT-Scotch , Parallel Computing, 34(6-8), 318-331, 2008. http://www.labri.fr/perso/pelegrin/scotch/ G. Karypis, K. Schloegel, V. Kumar, ParMetis: Parallel Graph Partitioning and Sparse Matrix Ordering Library , University of Minnesota. http://www- users.cs.umn.edu/~karypis/metis/parmetis/ E. Boman, K. Devine, et al., Zoltan, Parallel Partitioning, Load Balancing and Data-Management Services , Sandia National Laboratories. http://www.cs.sandia.gov/Zoltan/ J. Gilbert, Predicting structures in sparse matrix computations , SIAM. J. Matrix Anal. & App, 15(1), 62 79, 1994. J.W.H. Liu, Modification of the minimum degree algorithm by multiple elimination , ACM Trans. Math. Software, Vol. 11, 141-153, 1985. T.A. Davis, J.R. Gilbert, S. Larimore, E. Ng, A column approximate minimum degree ordering algorithm , ACM Trans. Math. Software, 30 (3), 353-376, 2004 P. Amestoy, X.S. Li and E. Ng, Diagonal Markowitz Scheme with Local Symmetrization , SIAM J. Matrix Anal. Appl., Vol. 29, No. 1, pp. 228-244, 2007. 34

  35. Exercises Homework3 in Hands-On-Exercises/ directory Show that: If RTR = AT+A and A = LU, then struct(L+U) struct(RT+R) Show that: [George/Ng `87] If RTR = ATA and PA = LU, then for any row permutation P, struct(L+U) struct(RT+R) 35

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#