Sketching Techniques for Efficient Numerical Linear Algebra on Massive Data Sets

undefined
S
k
e
t
c
h
i
n
g
 
a
s
 
a
 
T
o
o
l
 
f
o
r
 
N
u
m
e
r
i
c
a
l
L
i
n
e
a
r
 
A
l
g
e
b
r
a
A
l
l
 
L
e
c
t
u
r
e
s
David Woodruff
IBM Almaden
Massive data sets
 
Examples
Internet traffic logs
Financial data
etc.
 
 
Algorithms
Want nearly linear time or less
Usually at the cost of a randomized approximation
Regression analysis
Regression
Statistical method to study dependencies between
variables in the presence of noise.
Regression analysis
Linear Regression
Statistical method to study 
linear
 dependencies
between variables in the presence of noise.
Regression analysis
Linear Regression
Statistical method to study 
linear
 dependencies
between variables in the presence of noise.
Example
Ohm
'
s law V = R ∙ I
Regression analysis
Linear Regression
Statistical method to study 
linear
 dependencies
between variables in the presence of noise.
Example
Ohm
'
s law V = R ∙ I
Find linear function that
    best fits the data
Regression analysis
Linear Regression
Statistical method to study 
linear
 dependencies between
variables in the presence of noise.
Standard Setting
One measured variable b
A set of predictor variables a  ,…, a
Assumption:
                         b  = x  + a   x  + … + a    x   + 

is assumed to be noise and the x
i
 are model
parameters we want to learn
Can assume x
0
 = 0
Now consider n observations of b
1
d
1
1
d
d
0
Regression analysis
 
Matrix form
I
n
p
u
t
:
 
 
n
d
-
m
a
t
r
i
x
 
A
 
a
n
d
 
a
 
v
e
c
t
o
r
 
b
=
(
b
1
,
,
 
b
n
)
n
 
i
s
 
t
h
e
 
n
u
m
b
e
r
 
o
f
 
o
b
s
e
r
v
a
t
i
o
n
s
;
 
d
 
i
s
 
t
h
e
 
n
u
m
b
e
r
 
o
f
p
r
e
d
i
c
t
o
r
 
v
a
r
i
a
b
l
e
s
 
O
u
t
p
u
t
:
 
x
*
 
s
o
 
t
h
a
t
 
A
x
*
 
a
n
d
 
b
 
a
r
e
 
c
l
o
s
e
 
 
Consider the over-constrained case, when n 
À
 d
 
Can assume that A has full column rank
Regression analysis
 
Least Squares Method
 
Find x* that minimizes |Ax-b|
2
2
 = 
 (b
i
 – <A
i*
, x>)²
 
A
i*
 is i-th row of A
 
Certain desirable statistical properties
 
Regression analysis
 
Geometry of regression
We want to find an x that minimizes |Ax-b|
2
The product Ax can be written as
 
                                A
*1
x
1
 + A
*2
x
2
 + ... + A
*d
x
d
 
      where A
*i
 is the i-th column of A
 
This is a linear d-dimensional subspace
The problem is equivalent to computing the point of the
column space of A nearest to b in l
2
-norm
Regression analysis
 
Solving least squares regression via the normal equations
 
How to find the solution x to min
x
 |Ax-b|
2
 ?
 
Equivalent problem: min
x
 |Ax-b |
2
2
Write b = Ax’ + b’, where b’ orthogonal to columns of A
Cost is |A(x-x’)|
2
2
 + |b’|
2
2 
by Pythagorean theorem
Optimal solution x if and only if A
T
(Ax-b) = A
T
(Ax-Ax’) = 0
Normal Equation: A
T
Ax = A
T
b for any optimal x
x = (A
T
A)
-1
 A
T
 b
 
If the columns of A are not linearly independent, the Moore-
Penrose pseudoinverse gives a minimum norm solution x
 
Moore-Penrose Pseudoinverse
Time Complexity
Sketching to solve least squares regression
 
How to find an approximate solution x to min
x
 |Ax-b|
2
 ?
 
Goal:
 output x‘ for which |Ax‘-b|
2
 
·
 (1+
ε
) min
x
 |Ax-b|
2
with high probability
 
Draw S from a k x n random family of matrices, for a
value k << n
 
Compute S*A and S*b
 
Output the solution x‘ to min
x‘
 |(SA)x-(Sb)|
2
x’ = (SA)
-
Sb
How to choose the right sketching matrix S?
 
Recall: output the solution x‘ to min
x‘
 |(SA)x-(Sb)|
2
 
Lots of matrices work
 
S is d/
ε
2
 x n matrix of i.i.d. Normal random variables
 
To see why this works, we
    introduce the notion of a
    subspace embedding
Subspace Embeddings
 
Let k = O(d/
ε
2
)
Let S be a k x n matrix of i.i.d. normal
N(0,1/k) random variables
For any fixed d-dimensional subspace, i.e.,
the column space of an n x d matrix A
W.h.p., for all x in R
d
, |SAx|
2
 = (1±
ε
)|Ax|
2
Entire column space of A is preserved
 
Why is this true?
Subspace Embeddings – A Proof
 
Rotational Invariance
Orthogonal Implies Independent
Where were we?
Back to Subspace Embeddings
Johnson-Lindenstrauss Theorem
Net for Sphere
Net for Subspace
Net Argument
Finishing the Net Argument
Finishing the Net Argument
We showed that S is a subspace
embedding, that is, simultaneously 
for all x,
|SAx|
2
 = (1±
ε
)|Ax|
2 
What does this have to do with regression?
Back to Regression
Subspace Embeddings for
Regression
 
Want x so that |Ax-b|
2
 
·
 (1+
ε
) min
y
 |Ay-b|
2
Consider subspace L spanned by columns of A
together with b
Then for all y in L, |Sy|
2
 = (1± 
ε
) |y|
2
Hence, |S(Ax-b)|
2
 = (1± 
ε
) |Ax-b|
2
 for all x
Solve argmin
y
 |(SA)y – (Sb)|
2
Given SA, Sb, can solve in poly(d/
ε
) time
 
Only problem is computing SA takes O(nd
2
) time
How to choose the right sketching matrix S? [S]
 
S is a Subsampled Randomized Hadamard Transform
S = P*H*D
 
D is a diagonal matrix with +1, -1 on diagonals
 
H is the Hadamard transform
 
P just chooses a random (small) subset of rows of H*D
 
S*A can be computed in O(nd log n) time
 
Why does it work?
Why does this work?
Proving the Flattening Lemma
Consequence of the Flattening Lemma
Matrix Chernoff Bound
Matrix Chernoff Bound
Matrix Chernoff Bound
SRHT Wrapup
Faster Subspace Embeddings S [CW,MM,NN]
 
 
CountSketch matrix
 
Define k x n matrix S, for k = O(d
2
/
ε
2
)
 
S is really sparse: single randomly chosen non-zero
entry per column
 
 
nnz(A) is number of non-zero entries of A
Simple Proof [Nguyen]
Matrix Product Result [Kane, Nelson]
From Vectors to Matrices
 
(From vectors to matrices) For 
ϵ, δ∈
 0, 1 2  
0,
 1 2 
1
 1 2 
2
 1 2 
 0, 1 2  
, 
let 
D
 be a distribution on matrices
S with k rows and n columns that satisfies the 
(ϵ, δ, ℓ)
-JL moment property
for some 
ℓ≥2.
 Then for A, B matrices with n rows,
 
  Pr S     A T  S T SB − A T B  F ≥3 ϵ  A  F   B  F  ≤δ  
 Pr S 
Pr
 Pr S 
S
 Pr S 
    A T  S T SB − A T B  F ≥3 ϵ  A  F   B  F  
   A T  S T SB − A T B  F 
  A T  S T SB − A T B 
 A T 
A
 A T 
T
 A T 
 S T 
S
 S T 
T
 S T 
SB −
 A T 
A
 A T 
T
 A T 
B
  A T  S T SB − A T B 
   A T  S T SB − A T B  F 
F
   A T  S T SB − A T B  F 
≥3 ϵ
  A  F 
 A 
A
 A 
  A  F 
F
  A  F 
  B  F 
 B 
B
 B 
  B  F 
F
  B  F 
    A T  S T SB − A T B  F ≥3 ϵ  A  F   B  F  
≤δ
  Pr S     A T  S T SB − A T B  F ≥3 ϵ  A  F   B  F  ≤δ  
  Pr S     A T  S T SB − A T B  F ≥3 ϵ  A  F   B  F  ≤δ
 
Proof: For a random scalar X, let 
  X  p 
 X 
X
 X 
  X  p 
p
  X  p 
=
  E  X  p   1/p 
 E  X  p  
E
  X  p 
 X 
X
 X 
  X  p 
p
  X  p 
 E  X  p  
  E  X  p   1/p 
1/p
  E  X  p   1/p
Sometimes consider 
X=
  T  F 
 T 
T
 T 
  T  F 
F
  T  F 
 for a random matrix T
 |  T  F 
| 
 T 
T
 T 
 |  T  F 
F
 |  T  F 
   ​  p 
 
   ​  p 
p
   ​  p 
=
  E   T  F p    1/p 
 E   T  F p   
E
   T  F p  
  T  F p 
 T 
T
 T 
  T  F p 
F
  T  F p 
p
  T  F p 
   T  F p  
 E   T  F p   
  E   T  F p    1/p 
1/p
  E   T  F p    1/p
 
Can show 
|.
  ​  p 
  ​  p 
p
  ​  p 
 is a norm
Minkowski’s Inequality: 
  X+Y  p 
 X+Y 
X+Y
 X+Y 
  X+Y  p 
p
  X+Y  p 
  X  p 
 X 
X
 X 
  X  p 
p
  X  p 
+
  Y  p 
 Y 
Y
 Y 
  Y  p 
p
  Y  p
 
For unit vectors x, y, need to bound 
|〈Sx, Sy〉
 - 
x, y〉
  ​  ℓ 
  ​  ℓ 
  ​  ℓ
From Vectors to Matrices
 
For unit vectors x, y, 
|〈Sx, Sy〉
 - 
x, y〉
  ​  ℓ 
  ​  ℓ 
  ​  ℓ
           
= 
 1 2 
1
 1 2 
2
 1 2 
  ( Sx  2 2 −1 
 ( Sx  2 2 
(
 Sx 
Sx
 Sx 
 ( Sx  2 2 
2
 ( Sx  2 2 
2
 ( Sx  2 2 
−1
  ( Sx  2 2 −1 
+
   Sy  2 2 −1 
  Sy  2 2 
 Sy 
Sy
 Sy 
  Sy  2 2 
2
  Sy  2 2 
2
  Sy  2 2 
−1
   Sy  2 2 −1 
   S x−y   2 2 −  x−y  2 2  
  S x−y   2 2 
 S x−y  
S
 x−y 
x−y
 x−y 
 S x−y  
  S x−y   2 2 
2
  S x−y   2 2 
2
  S x−y   2 2 
  x−y  2 2 
 x−y 
x−y
 x−y 
  x−y  2 2 
2
  x−y  2 2 
2
  x−y  2 2 
   S x−y   2 2 −  x−y  2 2  
  ​  ℓ 
  ​  ℓ 
  ​  ℓ
           
 1 2 
1
 1 2 
2
 1 2 
 (
    Sx  2 2 −1  ℓ 
   Sx  2 2 −1 
  Sx  2 2 
 Sx 
Sx
 Sx 
  Sx  2 2 
2
  Sx  2 2 
2
  Sx  2 2 
−1
   Sx  2 2 −1 
    Sx  2 2 −1  ℓ 
    Sx  2 2 −1  ℓ 
+
    Sy  2 2 −1  ℓ 
   Sy  2 2 −1 
  Sy  2 2 
 Sy 
Sy
 Sy 
  Sy  2 2 
2
  Sy  2 2 
2
  Sy  2 2 
−1
   Sy  2 2 −1 
    Sy  2 2 −1  ℓ 
    Sy  2 2 −1  ℓ 
+
    S x−y   2 2 −  x−y  2 2   ℓ 
   S x−y   2 2 −  x−y  2 2  
  S x−y   2 2 
 S x−y  
S
 x−y 
x−y
 x−y 
 S x−y  
  S x−y   2 2 
2
  S x−y   2 2 
2
  S x−y   2 2 
  x−y  2 2 
 x−y 
x−y
 x−y 
  x−y  2 2 
2
  x−y  2 2 
2
  x−y  2 2 
   S x−y   2 2 −  x−y  2 2  
    S x−y   2 2 −  x−y  2 2   ℓ 
    S x−y   2 2 −  x−y  2 2   ℓ 
)
           
 1 2 
1
 1 2 
2
 1 2 
 ⋅δ  1 ℓ  
⋅δ
 ⋅δ  1 ℓ  
 1 ℓ 
1
 1 ℓ 
 1 ℓ 
 ⋅δ  1 ℓ  
+ϵ⋅
 δ  1 ℓ  
δ
 δ  1 ℓ  
 1 ℓ 
1
 1 ℓ 
 1 ℓ 
 δ  1 ℓ  
+
  x−y  2 2 
 x−y 
x−y
 x−y 
  x−y  2 2 
2
  x−y  2 2 
2
  x−y  2 2 
 ϵ⋅
 δ  1 ℓ  
δ
 δ  1 ℓ  
 1 ℓ 
1
 1 ℓ 
 1 ℓ 
 δ  1 ℓ  
)
           
≤3 ϵ⋅
 δ  1 ℓ  
δ
 δ  1 ℓ  
 1 ℓ 
1
 1 ℓ 
 1 ℓ 
 δ  1 ℓ
 
By linearity, for arbitrary x, y, 
    Sx, Sy  − x, y   ℓ    x  2   y  2  
   Sx, Sy  − x, y   ℓ 
  Sx, Sy  − x, y  
 Sx, Sy  
Sx, Sy 
 Sx, Sy  
 x, y 
x, y
 x, y 
  Sx, Sy  − x, y  
   Sx, Sy  − x, y   ℓ 
   Sx, Sy  − x, y   ℓ 
    Sx, Sy  − x, y   ℓ    x  2   y  2  
  x  2 
 x 
x
 x 
  x  2 
2
  x  2 
  y  2 
 y 
y
 y 
  y  2 
2
  y  2 
    Sx, Sy  − x, y   ℓ    x  2   y  2  
≤3 ϵ⋅
 δ  1 ℓ  
δ
 δ  1 ℓ  
 1 ℓ 
1
 1 ℓ 
 1 ℓ 
 δ  1 ℓ
 
Suppose A has d columns and B has e columns. Let the columns of A be
 A 1 
A
 A 1 
1
 A 1 
,…, 
 A d 
A
 A d 
d
 A d 
 and the columns of B be 
 B 1 
B
 B 1 
1
 B 1 
,…, 
 B e 
B
 B e 
e
 B e
 
Define 
 X i,j 
X
 X i,j 
i,j
 X i,j 
=
 1    A i   2    B j   2  
1
 1    A i   2    B j   2  
   A i   2 
  A i  
 A i 
A
 A i 
i
 A i 
  A i  
   A i   2 
2
   A i   2 
   B j   2 
  B j  
 B j 
B
 B j 
j
 B j 
  B j  
   B j   2 
2
   B j   2 
 1    A i   2    B j   2  
⋅(
 S A i , S B j  
S
 A i 
A
 A i 
i
 A i 
, S
 B j 
B
 B j 
j
 B j 
 S A i , S B j  
  A i ,  B j  
 A i 
A
 A i 
i
 A i 
, 
 B j 
B
 B j 
j
 B j 
  A i ,  B j  
)
 
 
   A T  S T SB − A T B  F 2 
  A T  S T SB − A T B 
 A T 
A
 A T 
T
 A T 
 S T 
S
 S T 
T
 S T 
SB −
 A T 
A
 A T 
T
 A T 
B
  A T  S T SB − A T B 
   A T  S T SB − A T B  F 2 
F
   A T  S T SB − A T B  F 2 
2
   A T  S T SB − A T B  F 2 
=
 i   j     A i   2 2   
i
 i   j     A i   2 2   
 i   j     A i   2 2   
 j     A i   2 2  
j
 j     A i   2 2  
 j     A i   2 2  
   A i   2 2 
  A i  
 A i 
A
 A i 
i
 A i 
  A i  
   A i   2 2 
2
   A i   2 2 
2
   A i   2 2 
 j     A i   2 2  
 i   j     A i   2 2   
   B j   2 2 
  B j  
 B j 
B
 B j 
j
 B j 
  B j  
   B j   2 2 
2
   B j   2 2 
2
   B j   2 2 
 X i,j  2 
X
 X i,j  2 
i,j 
 X i,j  2 
2
 X i,j  2
 
 
Have shown: for arbitrary x, y, 
    Sx, Sy  − x, y   ℓ    x  2   y  2  
   Sx, Sy  − x, y   ℓ 
  Sx, Sy  − x, y  
 Sx, Sy  
Sx, Sy 
 Sx, Sy  
 x, y 
x, y
 x, y 
  Sx, Sy  − x, y  
   Sx, Sy  − x, y   ℓ 
   Sx, Sy  − x, y   ℓ 
    Sx, Sy  − x, y   ℓ    x  2   y  2  
  x  2 
 x 
x
 x 
  x  2 
2
  x  2 
  y  2 
 y 
y
 y 
  y  2 
2
  y  2 
    Sx, Sy  − x, y   ℓ    x  2   y  2  
≤3 ϵ⋅
 δ  1 ℓ  
δ
 δ  1 ℓ  
 1 ℓ 
1
 1 ℓ 
 1 ℓ 
 δ  1 ℓ
For 
 X i,j 
X
 X i,j 
i,j
 X i,j 
=
 1    A i   2    B j   2  
1
 1    A i   2    B j   2  
   A i   2 
  A i  
 A i 
A
 A i 
i
 A i 
  A i  
   A i   2 
2
   A i   2 
   B j   2 
  B j  
 B j 
B
 B j 
j
 B j 
  B j  
   B j   2 
2
   B j   2 
 1    A i   2    B j   2  
  S A i , S B j  −  A i ,  B j   
 S A i , S B j  
S
 A i 
A
 A i 
i
 A i 
, S
 B j 
B
 B j 
j
 B j 
 S A i , S B j  
  A i ,  B j  
 A i 
A
 A i 
i
 A i 
, 
 B j 
B
 B j 
j
 B j 
  A i ,  B j  
  S A i , S B j  −  A i ,  B j   
:
 
   A T  S T SB − A T B  F 2 
  A T  S T SB − A T B 
 A T 
A
 A T 
T
 A T 
 S T 
S
 S T 
T
 S T 
SB −
 A T 
A
 A T 
T
 A T 
B
  A T  S T SB − A T B 
   A T  S T SB − A T B  F 2 
F
   A T  S T SB − A T B  F 2 
2
   A T  S T SB − A T B  F 2 
=
 i   j     A i   2 2   
i
 i   j     A i   2 2   
 i   j     A i   2 2   
 j     A i   2 2  
j
 j     A i   2 2  
 j     A i   2 2  
   A i   2 2 
  A i  
 A i 
A
 A i 
i
 A i 
  A i  
   A i   2 2 
2
   A i   2 2 
2
   A i   2 2 
 j     A i   2 2  
 i   j     A i   2 2   
   B j   2 2 
  B j  
 B j 
B
 B j 
j
 B j 
  B j  
   B j   2 2 
2
   B j   2 2 
2
   B j   2 2 
 X i,j  2 
X
 X i,j  2 
i,j 
 X i,j  2 
2
 X i,j  2
 |  A T  S T SB − A T B  F 2 
|
  A T  S T SB − A T B 
 A T 
A
 A T 
T
 A T 
 S T 
S
 S T 
T
 S T 
SB −
 A T 
A
 A T 
T
 A T 
B
  A T  S T SB − A T B 
 |  A T  S T SB − A T B  F 2 
F
 |  A T  S T SB − A T B  F 2 
2
 |  A T  S T SB − A T B  F 2 
  ​  ℓ/2 
  ​  ℓ/2 
ℓ/2
  ​  ℓ/2 
=
   i   j     A i   2 2   ⋅   B j   2 2  X i,j  2   ℓ/2 
  i   j     A i   2 2   ⋅   B j   2 2  X i,j  2  
 i   j     A i   2 2   
i
 i   j     A i   2 2   
 i   j     A i   2 2   
 j     A i   2 2  
j
 j     A i   2 2  
 j     A i   2 2  
   A i   2 2 
  A i  
 A i 
A
 A i 
i
 A i 
  A i  
   A i   2 2 
2
   A i   2 2 
2
   A i   2 2 
 j     A i   2 2  
 i   j     A i   2 2   
   B j   2 2 
  B j  
 B j 
B
 B j 
j
 B j 
  B j  
   B j   2 2 
2
   B j   2 2 
2
   B j   2 2 
 X i,j  2 
X
 X i,j  2 
i,j 
 X i,j  2 
2
 X i,j  2 
  i   j     A i   2 2   ⋅   B j   2 2  X i,j  2  
   i   j     A i   2 2   ⋅   B j   2 2  X i,j  2   ℓ/2 
ℓ/2
   i   j     A i   2 2   ⋅   B j   2 2  X i,j  2   ℓ/2
  
             
 i   j     A i   2 2   
i
 i   j     A i   2 2   
 i   j     A i   2 2   
 j     A i   2 2  
j
 j     A i   2 2  
 j     A i   2 2  
   A i   2 2 
  A i  
 A i 
A
 A i 
i
 A i 
  A i  
   A i   2 2 
2
   A i   2 2 
2
   A i   2 2 
 j     A i   2 2  
 i   j     A i   2 2   
   B j   2 2 
  B j  
 B j 
B
 B j 
j
 B j 
  B j  
   B j   2 2 
2
   B j   2 2 
2
   B j   2 2 
 |X i,j  2 
|
X
 |X i,j  2 
i,j 
 |X i,j  2 
2
 |X i,j  2 
  ​  ℓ/2 
  ​  ℓ/2 
ℓ/2
  ​  ℓ/2
                                          
=
 i   j     A i   2 2 ⋅   B j   2 2    X i,j   ℓ 2   
i
 i   j     A i   2 2 ⋅   B j   2 2    X i,j   ℓ 2   
 i   j     A i   2 2 ⋅   B j   2 2    X i,j   ℓ 2   
 j     A i   2 2 ⋅   B j   2 2    X i,j   ℓ 2  
j
 j     A i   2 2 ⋅   B j   2 2    X i,j   ℓ 2  
 j     A i   2 2 ⋅   B j   2 2    X i,j   ℓ 2  
   A i   2 2 
  A i  
 A i 
A
 A i 
i
 A i 
  A i  
   A i   2 2 
2
   A i   2 2 
2
   A i   2 2 
   B j   2 2 
  B j  
 B j 
B
 B j 
j
 B j 
  B j  
   B j   2 2 
2
   B j   2 2 
2
   B j   2 2 
   X i,j   ℓ 2 
  X i,j  
 X i,j 
X
 X i,j 
i,j
 X i,j 
  X i,j  
   X i,j   ℓ 2 
   X i,j   ℓ 2 
2
   X i,j   ℓ 2 
 j     A i   2 2 ⋅   B j   2 2    X i,j   ℓ 2  
 i   j     A i   2 2 ⋅   B j   2 2    X i,j   ℓ 2
                                          
  3ϵ δ  1 ℓ    2 
 3ϵ δ  1 ℓ   
 δ  1 ℓ  
δ
 δ  1 ℓ  
 1 ℓ 
1
 1 ℓ 
 1 ℓ 
 δ  1 ℓ  
 3ϵ δ  1 ℓ   
  3ϵ δ  1 ℓ    2 
2
  3ϵ δ  1 ℓ    2 
 
 i   j     A i   2 2    B j   2 2   
i
 i   j     A i   2 2    B j   2 2   
 i   j     A i   2 2    B j   2 2   
 j     A i   2 2    B j   2 2  
j
 j     A i   2 2    B j   2 2  
 j     A i   2 2    B j   2 2  
   A i   2 2 
  A i  
 A i 
A
 A i 
i
 A i 
  A i  
   A i   2 2 
2
   A i   2 2 
2
   A i   2 2 
   B j   2 2 
  B j  
 B j 
B
 B j 
j
 B j 
  B j  
   B j   2 2 
2
   B j   2 2 
2
   B j   2 2 
 j     A i   2 2    B j   2 2  
 i   j     A i   2 2    B j   2 2
                                          
=
  3 ϵ δ  1 ℓ    2 
 3 ϵ δ  1 ℓ   
3 ϵ
 δ  1 ℓ  
δ
 δ  1 ℓ  
 1 ℓ 
1
 1 ℓ 
 1 ℓ 
 δ  1 ℓ  
 3 ϵ δ  1 ℓ   
  3 ϵ δ  1 ℓ    2 
2
  3 ϵ δ  1 ℓ    2 
  A  F 2 
 A 
A
 A 
  A  F 2 
F
  A  F 2 
2
  A  F 2 
  B  F 2 
 B 
B
 B 
  B  F 2 
F
  B  F 2 
2
  B  F 2
 
Since 
E
    A T  S T SB − A T B  F ℓ  
   A T  S T SB − A T B  F ℓ 
  A T  S T SB − A T B 
 A T 
A
 A T 
T
 A T 
 S T 
S
 S T 
T
 S T 
SB −
 A T 
A
 A T 
T
 A T 
B
  A T  S T SB − A T B 
   A T  S T SB − A T B  F ℓ 
F
   A T  S T SB − A T B  F ℓ 
   A T  S T SB − A T B  F ℓ 
    A T  S T SB − A T B  F ℓ  
=
       A T  S T SB− A T B  F 2    ℓ 2   ℓ/2   
      A T  S T SB− A T B  F 2    ℓ 2   ℓ/2 
     A T  S T SB− A T B  F 2    ℓ 2  
    A T  S T SB− A T B  F 2  
   A T  S T SB− A T B  F 2 
  A T  S T SB− A T B 
 A T 
A
 A T 
T
 A T 
 S T 
S
 S T 
T
 S T 
SB−
 A T 
A
 A T 
T
 A T 
B
  A T  S T SB− A T B 
   A T  S T SB− A T B  F 2 
F
   A T  S T SB− A T B  F 2 
2
   A T  S T SB− A T B  F 2 
    A T  S T SB− A T B  F 2  
     A T  S T SB− A T B  F 2    ℓ 2  
 ℓ 2 
 ℓ 2 
2
 ℓ 2 
     A T  S T SB− A T B  F 2    ℓ 2  
      A T  S T SB− A T B  F 2    ℓ 2   ℓ/2 
ℓ/2
      A T  S T SB− A T B  F 2    ℓ 2   ℓ/2 
       A T  S T SB− A T B  F 2    ℓ 2   ℓ/2   
       A T  S T SB− A T B  F 2    ℓ 2   ℓ/2   
  
, by Markov’s inequality,
Pr
    A T  S T SB− A T B  F >3ϵ  A  F   B  F  
   A T  S T SB− A T B  F 
  A T  S T SB− A T B 
 A T 
A
 A T 
T
 A T 
 S T 
S
 S T 
T
 S T 
SB−
 A T 
A
 A T 
T
 A T 
B
  A T  S T SB− A T B 
   A T  S T SB− A T B  F 
F
   A T  S T SB− A T B  F 
>3ϵ
  A  F 
 A 
A
 A 
  A  F 
F
  A  F 
  B  F 
 B 
B
 B 
  B  F 
F
  B  F 
    A T  S T SB− A T B  F >3ϵ  A  F   B  F  
   1 3ϵ  A  F   B  F    ℓ 
  1 3ϵ  A  F   B  F   
 1 3ϵ  A  F   B  F  
1
 1 3ϵ  A  F   B  F  
  A  F 
 A 
A
 A 
  A  F 
F
  A  F 
  B  F 
 B 
B
 B 
  B  F 
F
  B  F 
 1 3ϵ  A  F   B  F  
  1 3ϵ  A  F   B  F   
   1 3ϵ  A  F   B  F    ℓ 
   1 3ϵ  A  F   B  F    ℓ 
E
  |A T  S T SB− A T B  ​  F ℓ  
 |A T 
|A
 |A T 
T
 |A T 
 S T 
S
 S T 
T
 S T 
SB−
 A T 
A
 A T 
T
 A T 
B
  ​  F ℓ 
  ​  F ℓ 
F
  ​  F ℓ 
  ​  F ℓ 
  |A T  S T SB− A T B  ​  F ℓ  
≤δ
 
 
From Vectors to Matrices
Result for Vectors
CountSketch Satisfies the JL Property
CountSketch Satisfies the JL Property
Where are we?
Course Outline
Subspace embeddings and least squares regression
Gaussian matrices
Subsampled Randomized Hadamard Transform
CountSketch
Affine embeddings
Application to low rank approximation
High precision regression
Leverage score sampling
Affine Embeddings
Affine Embeddings
Affine Embeddings
 
Claim: 
  A+B  F 2 
 A+B 
A+B
 A+B 
  A+B  F 2 
F
  A+B  F 2 
2
  A+B  F 2 
=
  A  F 2 
 A 
A
 A 
  A  F 2 
F
  A  F 2 
2
  A  F 2 
+
  B  F 2 
 B 
B
 B 
  B  F 2 
F
  B  F 2 
2
  B  F 2 
+2Tr(
 A T 
A
 A T 
T
 A T 
B)
 
Proof: 
  A+B  F 2 
 A+B 
A+B
 A+B 
  A+B  F 2 
F
  A+B  F 2 
2
  A+B  F 2 
=
 i     A i + B i   2 2  
i
 i     A i + B i   2 2  
 i     A i + B i   2 2  
   A i + B i   2 2 
  A i + B i  
 A i 
A
 A i 
i
 A i 
+
 B i 
B
 B i 
i
 B i 
  A i + B i  
   A i + B i   2 2 
2
   A i + B i   2 2 
2
   A i + B i   2 2 
 i     A i + B i   2 2
 
=
 i     A i   2 2 + i     B i   2 2 +2〈 A i ,  B i 〉  
i
 i     A i   2 2 + i     B i   2 2 +2〈 A i ,  B i 〉  
 i     A i   2 2 + i     B i   2 2 +2〈 A i ,  B i 〉  
   A i   2 2 
  A i  
 A i 
A
 A i 
i
 A i 
  A i  
   A i   2 2 
2
   A i   2 2 
2
   A i   2 2 
+
 i     B i   2 2 +2〈 A i ,  B i 〉 
i
 i     B i   2 2 +2〈 A i ,  B i 〉 
 i     B i   2 2 +2〈 A i ,  B i 〉 
   B i   2 2 
  B i  
 B i 
B
 B i 
i
 B i 
  B i  
   B i   2 2 
2
   B i   2 2 
2
   B i   2 2 
+2〈
 A i 
A
 A i 
i
 A i 
, 
 B i 
B
 B i 
i
 B i 
 i     B i   2 2 +2〈 A i ,  B i 〉 
 i     A i   2 2 + i     B i   2 2 +2〈 A i ,  B i 〉
 
                       
=
  A  F 2 
 A 
A
 A 
  A  F 2 
F
  A  F 2 
2
  A  F 2 
+
  B  F 2 
 B 
B
 B 
  B  F 2 
F
  B  F 2 
2
  B  F 2 
+2Tr(
 A T 
A
 A T 
T
 A T 
B)
Affine Embeddings: Missing Proofs
Affine Embeddings: Missing Proofs
 
Claim: 
Tr
 AB 
AB
 AB 
  A  F 
 A 
A
 A 
  A  F 
F
  A  F 
  B  F 
 B 
B
 B 
  B  F 
F
  B  F
 
Proof: 
Tr
 AB 
AB
 AB 
=
 i  〈 A i ,  B i 〉 
i
 i  〈 A i ,  B i 〉 
 i  〈 A i ,  B i 〉 
 A i 
A
 A i 
i
 A i 
, 
 B i 
B
 B i 
i
 B i 
 i  〈 A i ,  B i 〉 
 for rows 
 A i 
A
 A i 
i
 A i 
 and columns 
 B i 
B
 B i 
i
 B i
 
 i     A i   2    B i   2  
i
 i     A i   2    B i   2  
 i     A i   2    B i   2  
   A i   2 
  A i  
 A i 
A
 A i 
i
 A i 
  A i  
   A i   2 
2
   A i   2 
   B i   2 
  B i  
 B i 
B
 B i 
i
 B i 
  B i  
   B i   2 
2
   B i   2 
 i     A i   2    B i   2  
 by Cauchy-Schwarz for each i
 
   i     A i   2 2      1 2  
  i     A i   2 2    
 i     A i   2 2   
i
 i     A i   2 2   
 i     A i   2 2   
   A i   2 2 
  A i  
 A i 
A
 A i 
i
 A i 
  A i  
   A i   2 2 
2
   A i   2 2 
2
   A i   2 2 
 
 i     A i   2 2   
  i     A i   2 2    
   i     A i   2 2      1 2  
 1 2 
1
 1 2 
2
 1 2 
   i     A i   2 2      1 2  
   i     B i   2 2      1 2  
  i     B i   2 2    
 i     B i   2 2   
i
 i     B i   2 2   
 i     B i   2 2   
   B i   2 2 
  B i  
 B i 
B
 B i 
i
 B i 
  B i  
   B i   2 2 
2
   B i   2 2 
2
   B i   2 2 
 
 i     B i   2 2   
  i     B i   2 2    
   i     B i   2 2      1 2  
 1 2 
1
 1 2 
2
 1 2 
   i     B i   2 2      1 2  
 another Cauchy-Schwarz
 
               
=
  A  F 
 A 
A
 A 
  A  F 
F
  A  F 
  B  F 
 B 
B
 B 
  B  F 
F
  B  F
Affine Embeddings: Homework Proof
Low rank approximation
 
A is an n x d matrix
Think of n points in R
d
 
E.g., A is a customer-product matrix
A
i,j
 = how many times customer i purchased item j
 
A is typically well-approximated by low rank matrix
E.g., high rank because of noise
 
Goal:
 find a low rank matrix approximating A
Easy to store, data more interpretable
What is a good low rank approximation?
 
 
Singular Value Decomposition (SVD)
Any matrix A = U 
¢
 
Σ
 
¢
 V
 U has orthonormal columns
 
Σ
 is diagonal with non-increasing positive
entries down the diagonal
 V has orthonormal rows
 
 Rank-k approximation: A
k
 = U
k
 
¢
 
Σ
k
 
¢
 V
k
A
k
 = argmin
rank k matrices B
 |A-B|
F
(|C|
F
 = (
Σ
i,j
 C
i,j
2
)
1/2 
)
Computing A
k
 exactly is
expensive
The rows of V
k
 are
the top k 
principal
components
Low rank approximation
 
 
Goal:
 output a rank k matrix A’, so that
|A-A’|
F
 
·
 (1+
ε
) |A-A
k
|
F
 
 
Can do this in nnz(A) + (n+d)*poly(k/
ε
) time [S,CW]
nnz(A) is number of non-zero entries of A
 
 
 
 
 
Solution to low-rank approximation [S]
Given n x d input matrix A
Compute S*A using a random matrix S with k/
ε
 << n
rows. S*A takes random linear combinations of rows of A
SA
A
 
Project rows of A onto SA, then find best rank-k
approximation to points inside of SA.
What is the matrix S?
 
S can be a k/
ε
 x n matrix of i.i.d. normal random
variables
 
[S] S can be a k/
ε
 x n Fast Johnson Lindenstrauss
Matrix
Uses Fast Fourier Transform
 
 [CW] S can be a poly(k/
ε
) x n CountSketch matrix
 
[
 
[
 
0 0 1 0  0 1  0 0
1 0 0 0  0 0  0 0
0 0 0 -1 1 0 -1 0
0-1 0 0  0 0  0 1
S 
¢
 A can be
computed in
nnz(A) time
Why do these Matrices Work?
Why do these Matrices Work?
Caveat: projecting the points onto SA is slow
 
 
Current algorithm:
1.
Compute S*A
2.
Project each of the rows onto S*A
3.
Find best rank-k approximation of projected points
inside of rowspace of S*A
 
Bottleneck is step 2
 
[CW] Approximate the projection
Fast algorithm for approximate regression
min
rank-k X
 |X(SA)-A|
F
2
 
Want nnz(A) + (n+d)*poly(k/
ε
) time
min
rank-k X
 |X(SA)R-AR|
F
2
Can solve with affine embeddings
Using Affine Embeddings
Low Rank Approximation Summary
Course Outline
Subspace embeddings and least squares regression
Gaussian matrices
Subsampled Randomized Hadamard Transform
CountSketch
Affine embeddings
Application to low rank approximation
High precision regression
Leverage score sampling
High Precision Regression
 
Small QR Decomposition
Finding a Constant Factor Solution
Course Outline
Subspace embeddings and least squares regression
Gaussian matrices
Subsampled Randomized Hadamard Transform
CountSketch
Affine embeddings
Application to low rank approximation
High precision regression
Leverage score sampling
Leverage Score Sampling
Leverage Score Sampling
Leverage Score Sampling gives a Subspace Embedding
Leverage Score Sampling gives a Subspace Embedding
Applying the Matrix Chernoff Bound
Fast Computation of Leverage Scores
Fast Computation of Leverage Scores
Course Outline
Subspace embeddings and least squares regression
Gaussian matrices
Subsampled Randomized Hadamard Transform
CountSketch
Affine embeddings
Application to low rank approximation
High precision regression
Leverage score sampling
Distributed Low Rank Approximation
Distributed low rank approximation
 
We have fast algorithms for low rank approximation, but
can they be made to work in a distributed setting?
 
Matrix A distributed among s servers
 
For t = 1, …, s, we get a customer-product matrix from
the t-th shop stored in server t. Server t’s matrix = A
t
 
Customer-product matrix A = A
1
 + A
2
 + … + A
s
Model is called the 
arbitrary partition model
 
More general than the 
row-partition model 
in which each
customer shops in only one shop
The Communication Model
Server 1
Coordinator
 
Each player talks only to a Coordinator via 2-way communication
 
Can simulate arbitrary point-to-point communication up to factor of 2
     (and an additive O(log s) factor per message)
Server 2
Server s
Communication cost of low rank approximation
Work on Distributed Low Rank Approximation
 
[FSS]: First protocol for the row-partition model.
O(sdk/
ε
) real numbers of communication
Don’t analyze bit complexity (can be large)
SVD Running time, see also [BKLW]
 
[KVW]: O(skd/
ε
) communication in arbitrary partition model
 
[BWZ]: O(skd) + poly(sk/
ε
) words of communication in
arbitrary partition model. Input sparsity time
Matching 
Ω
(skd) words of communication lower bound
 
Variants: kernel low rank approximation [BLSWX], low rank
approximation of an implicit matrix [WZ], sparsity [BWZ]
Outline of Distributed Protocols
 
[FSS] protocol
 
[KVW] protocol
 
[BWZ] protocol
Constructing a Coreset [FSS]
Constructing a Coreset
Unions of Coresets
[FSS] Row-Partition Protocol
Coordinator
Problems:
1. sdk/
ε
 real numbers of communication
2. bit complexity can be large
3. running time for SVDs [BLKW]
4. doesn’t work in arbitrary partition model
This is an SVD-based protocol. Maybe
our random matrix techniques can
improve communication just like they
improved computation?
[KVW] protocol
will handle 2, 3,
and 4
[KVW] Arbitrary Partition Model Protocol
 
Inspired by the sketching algorithm presented earlier
 
Let S be one of the k/
ε
 x n random matrices discussed
S can be generated pseudorandomly from small seed
Coordinator sends small seed for S to all servers
 
Server t computes SA
t
 and sends it to Coordinator
 
Coordinator sends 
Σ
t
=1
s
 SA
t
 = SA to all servers
 
There is a good k-dimensional subspace inside of SA. If
we knew it, t-th server could output projection of A
t
 onto it
[KVW]
 protocol
Phase 1:
Learn the row space of SA
SA
optimal k-dimensional 
            space in SA
cost 
·
 (1+
ε
)|A-A
k
|
F
[KVW] protocol
Phase 2:
Find an approximately optimal space W inside of SA
SA
optimal space in SA
approximate 
space W in SA
cost 
·
 (1+
ε
)
2
|A-A
k
|
F
[BWZ] Protocol
[BWZ] Protocol
Intuitively, U looks like top k
left singular vectors of SA
Top k right singular vectors of SA
work because S is a projection-
cost preserving sketch!
[BWZ] Analysis
Conclusions for Distributed Low Rank Approximation
 
[BWZ] Optimal O(sdk) + poly(sk/
ε
) communication protocol for low
rank approximation in arbitrary partition model
Handle bit complexity by adding Tao/Vu noise
Input sparsity time
2 rounds, which is optimal [W]
Optimal data stream algorithms improves [CW, L, GP]
 
Communication of other optimization problems?
Computing the rank of an n x n matrix over the reals
Linear Programming
Graph problems: Matching
etc.
Additional Time-Permitting Topics
Will cover some recent topics at a research-level (many
details omitted)
Weighted Low Rank Approximation
Regression and Low Rank Approximation with M-
Estimator Loss Functions
Finding Heavy Hitters in a Data Stream optimally
Slide Note
Embed
Share

"Explore how sketching methods can be applied in numerical linear algebra to handle massive data sets efficiently. David Woodruff of IBM Almaden discusses using randomized approximations for algorithms aiming for nearly linear time complexity. Applications include analyzing internet traffic logs, financial data, and more."

  • Linear Algebra
  • Sketching Techniques
  • Numerical Algorithms
  • Big Data Analysis
  • Randomized Approximations

Uploaded on Sep 13, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Sketching as a Tool for Numerical Linear Algebra All Lectures David Woodruff IBM Almaden

  2. Massive data sets Examples Internet traffic logs Financial data etc. Algorithms Want nearly linear time or less Usually at the cost of a randomized approximation 2

  3. Regression analysis Regression Statistical method to study dependencies between variables in the presence of noise. 3

  4. Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. 4

  5. Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Regression 250 Example Ohm's law V = R I 200 150 Example Regression 100 50 0 0 50 100 150 5

  6. Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Regression 250 Example Ohm's law V = R I Find linear function that best fits the data 200 150 Example Regression 100 50 0 0 50 100 150 6

  7. Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Standard Setting One measured variable b A set of predictor variables a , , a Assumption: b = x + a x + + a x + is assumed to be noise and the xiare model parameters we want to learn Can assume x0= 0 Now consider n observations of b 1 d 1 1 0 d d 7

  8. Regression analysis Matrix form Input: n d-matrix A and a vector b=(b1, , bn) n is the number of observations; d is the number of predictor variables Output: x*so that Ax* and b are close Consider the over-constrained case, when n d Can assume that A has full column rank 8

  9. Regression analysis Least Squares Method Find x* that minimizes |Ax-b|22= (bi <Ai*, x>) Ai*is i-th row of A Certain desirable statistical properties 9

  10. Regression analysis Geometry of regression We want to find an x that minimizes |Ax-b|2 The product Ax can be written as A*1x1+ A*2x2+ ... + A*dxd where A*iis the i-th column of A This is a linear d-dimensional subspace The problem is equivalent to computing the point of the column space of A nearest to b in l2-norm 10

  11. Regression analysis Solving least squares regression via the normal equations How to find the solution x to minx |Ax-b|2 ? Equivalent problem: minx |Ax-b |22 Write b = Ax + b , where b orthogonal to columns of A Cost is |A(x-x )|22+ |b |22 by Pythagorean theorem Optimal solution x if and only if AT(Ax-b) = AT(Ax-Ax ) = 0 Normal Equation: ATAx = ATb for any optimal x x = (ATA)-1 AT b If the columns of A are not linearly independent, the Moore- Penrose pseudoinverse gives a minimum norm solution x 11

  12. Moore-Penrose Pseudoinverse Any optimal solution x has the form A b + I V V Tz, where V corresponds to the rows i of VT for which ?,? > 0 Why? Because A I V V Tz = 0, so A b + I V V Tz is a solution. This is a d-rank(A) dimensional affine space so it spans all optimal solutions Since A bis in column span of V , by Pythagorean theorem, |A b + I V V Tz|2 2= A b2 2+ |(I V V T)z|2 2 A b2 2 12

  13. Time Complexity Solving least squares regression via the normal equations Need to compute x = A-b Naively this takes nd2time Can do nd1.376using fast matrix multiplication But we want much better running time! 13

  14. Sketching to solve least squares regression How to find an approximate solution x to minx |Ax-b|2 ? Goal: output x for which |Ax -b|2 (1+ ) minx |Ax-b|2 with high probability Draw S from a k x n random family of matrices, for a value k << n Compute S*A and S*b Output the solution x to minx |(SA)x-(Sb)|2 x = (SA)-Sb 14

  15. How to choose the right sketching matrix S? Recall: output the solution x to minx |(SA)x-(Sb)|2 Lots of matrices work S is d/ 2 x n matrix of i.i.d. Normal random variables To see why this works, we introduce the notion of a subspace embedding 15

  16. Subspace Embeddings Let k = O(d/ 2) Let S be a k x n matrix of i.i.d. normal N(0,1/k) random variables For any fixed d-dimensional subspace, i.e., the column space of an n x d matrix A W.h.p., for all x in Rd, |SAx|2 = (1 )|Ax|2 Entire column space of A is preserved Why is this true?

  17. Subspace Embeddings A Proof Want to show |SAx|2 = (1 )|Ax|2 for all x Can assume columns of A are orthonormal (since we prove this for all x) Claim: SA is a k x d matrix of i.i.d. N(0,1/k) random variables First property: for two independent random variables X and Y, with X drawn from N(0,a2) and Y drawn from N(0,b2), we have X+Y is drawn from N(0, a2+ b2)

  18. X+Y is drawn from N(0, ?2+ ?2) Probability density function ?? of Z = X+Y is convolution of probability density functions ?? and ?? Calculation: ? ? ?2 2 ?2? ?2+?2 ? ?2 2 ?2+?2 ??2 ?2+?2 ?2 2?2 = ? 2 2?2 ??? = ??? ? ??? ?? 1 1 ? 2?.5? ?2/2?2 , ??? = ? 2?.5? ?2/2?2 ??? = 2 ?2? ?2+?2 ? ??2 ?2+?2 ?2+?2.5 2?.5??? 2 1 1 ? 2?.5? (? ?)2/2?2 ? 2?.5? ?2/2?2?? dx = 1 ??? = Density of Gaussian distribution: 2 ?2? ?2+?2 ? ??2 ?2+?2 ?2+?2.5 2?.5??? 2 1 2?.5?2+?2 .5? ?2/2 ?2+?2 dx =

  19. Rotational Invariance Second property: if u, v are vectors with <u, v> = 0, then <g,u> and <g,v> are independent, where g is a vector of i.i.d. N(0,1/k) random variables Why? If g is an n-dimensional vector of i.i.d. N(0,1) random variables, and R is a fixed matrix, then the probability density function of Rg is ??RRT 1? det(RRT) 2??/2? 1 ? ? = 2 RRT is the covariance matrix For a rotation matrix R, the distribution of Rg and of g are the same

  20. Orthogonal Implies Independent Want to show: if u, v are vectors with <u, v> = 0, then <g,u> and <g,v> are independent, where g is a vector of i.i.d. N(0,1/k) random variables Choose a rotation R which sends u to e1, and sends v to e2 < g,u > = < gR,RTu > = < h, e1> = h1 < g,v > = < gR,RTv > = < h, e2> = h2 where h is a vector of i.i.d. N(0, 1/k) random variables Then h1 and h2are independent by definition

  21. Where were we? Claim: SA is a k x d matrix of i.i.d. N(0,1/k) random variables Proof: The rows of SA are independent Each row is: < g,A1>,< g,A2> , ,< g,Ad> First property implies the entries in each row are N(0,1/k) since the columns Ai have unit norm Since the columns Ai are orthonormal, the entries in a row are independent by our second property

  22. Back to Subspace Embeddings Want to show |SAx|2 = (1 )|Ax|2 for all x Can assume columns of A are orthonormal Can also assume x is a unit vector SA is a k x d matrix of i.i.d. N(0,1/k) random variables Consider any fixed unit vector ? ?? SAx2 2= i k< gi,x >2, where gi is i-th row of SA 2 Each < gi,x >2 is distributed as N 0,1 E[< gi,x >2] = 1/k, and so E[ SAx2 How concentrated is SAx2 k 2] = 1 2 about its expectation?

  23. Johnson-Lindenstrauss Theorem Suppose h1, ,hkare i.i.d. N(0,1) random variables Then G = ihi Apply known tail bounds to G: (Upper) Pr G k + 2 kx.5+ 2x e x (Lower) Pr G k 2 kx.5 e x If x = 2k 2 is a ?2-random variable 16, then Pr G k(1 ) 1 2e 2k/16 If k = ( 2log(1 )), this probability is 1- 2 1 1 2 d Pr SAx2 This only holds for a fixed x, how to argue for all x?

  24. Net for Sphere Consider the sphere Sd 1 Subset N is a -net if for all x Sd 1, there is a y N, such that x y2 Greedy construction of N While there is a point x Sd 1 of distance larger than from every point in N, include x in N The sphere of radius /2 around ever point in N is contained in the sphere of radius 1+ around 0d Further, all such spheres are disjoint Ratio of volume of d-dimensional sphere of radius 1+ to dimensional sphere of radius ? is 1 + d/( /2)d, so N 1 + d/( /2)d

  25. Net for Subspace Let M = {Ax | x in N}, so M 1 + d/( /2)d Claim: For every x in Sd 1, there is a y in M for which Ax y2 Proof: Let x in Sd 1 be such that x x Then Ax Ax columns of A are orthonormal. Set y = Ax 2 2 , using that the 2= x x

  26. Net Argument For a fixed unit x, Pr SAx2 For a fixed pair of unit x, x , SAx2 are all 1 with probability 1 2 d SA x x 2 A x x 2 So Pr < Ax,Ax > = < SAx,SAx > O Choose a -net M = {Ax | x in N} of size 5? By a union bound, for all pairs y, y in M, < y,y > = < Sy,Sy > O Condition on this event By linearity, if this holds for y, y in M, for y, y we have < y, y > = < Sy,Sy > O 2 1 1 2 d 2, SAx 2 2 2, SA x x 2 2= SAx2 2= Ax2 2+ SAx 2+ Ax 2 2 < SAx,SAx > 2 2 < Ax,Ax > 2 2 = 1 2 (d)

  27. Finishing the Net Argument Let y = Ax for an arbitrary x Sd 1 Let y1 M be such that y y1 2 Let be such that (y y1)2= 1 1/ (could be infinite) Let y2 Then y y1 y2 M be such that y y1 y2 2 2 2 Set y2=y2 all integers i, . Repeat, obtaining y1,y2,y3, such that for y y1 y2 yi 2 i Implies yi 2 i 1+ i 2 i 1

  28. Finishing the Net Argument Have y1,y2,y3, such that y = iyi and yi 2 2 i 1 2= |S iyi|2 2 = iSyi 2 = iyi 2 = | iyi|2 = y2 = 1 O Sy2 2+ 2 i,j< Syi,Syj> 2+ 2 i,j < yi,yj> O i,jyi 2yj2 2 O 2 O Since this held for an arbitrary y = Ax for unit x, by linearity it follows that for all x, |SAx|2 = (1 )|Ax|2

  29. Back to Regression We showed that S is a subspace embedding, that is, simultaneously for all x, |SAx|2 = (1 )|Ax|2 What does this have to do with regression?

  30. Subspace Embeddings for Regression Want x so that |Ax-b|2 (1+ ) miny |Ay-b|2 Consider subspace L spanned by columns of A together with b Then for all y in L, |Sy|2 = (1 ) |y|2 Hence, |S(Ax-b)|2 = (1 ) |Ax-b|2 for all x Solve argminy |(SA)y (Sb)|2 Given SA, Sb, can solve in poly(d/ ) time Only problem is computing SA takes O(nd2) time

  31. How to choose the right sketching matrix S? [S] S is a Subsampled Randomized Hadamard Transform S = P*H*D D is a diagonal matrix with +1, -1 on diagonals H is the Hadamard transform P just chooses a random (small) subset of rows of H*D S*A can be computed in O(nd log n) time Why does it work? 31

  32. Why does this work? We can again assume columns of A are orthonormal It suffices to show SAx2 2= 1 for all x 2= PHDAx2 HD is a rotation matrix, so HDAx2 Notation: let y = Ax 2= 1 for any x 2= Ax2 Flattening Lemma: For any fixed y, log.5nd/ n.5 Pr [ HDy C ] 2d 32

  33. Proving the Flattening Lemma log.5nd/ n.5 Flattening Lemma: Pr [ HDy C Let C > 0 be a constant. We will show for a fixed i in [n], ] 2d log.5nd/ n.5 Pr [ HDyi C ] 2nd If we show this, we can apply a union bound over all i HDyi = jHi,jDj,jyj t2 ( 2) 2 j j (Azuma-Hoeffding) Pr | jZj| > t 2e probability 1 Zj= Hi,jDj,jyj has 0 mean , where |Zj| j with |yj| n.5= j with probability 1 2=1 n |Zj| j j C2log C log.5nd nd 2e Pr | jZj| > 33 2 n.5 2nd

  34. Consequence of the Flattening Lemma Recall columns of A are orthonormal HDA has orthonormal columns log.5nd/ n.5 Flattening Lemma implies HDAei C probability 1 with 2d for a fixed i d log.5nd/ n.5 With probability 1 2, ejHDAei for all i,j C d.5log.5nd/ n.5 Given this, ejHDA2 C for all j (Can be optimized further) 34

  35. Matrix Chernoff Bound Let X1, ,Xs be independent copies of a symmetric random matrix X Rdxd with E X = 0, X2 , and E XTX Pr W2> 2d e s 2/(?2+ (here W2= sup Wx2/ x2) 2 ?2. Let W =1 s i [s]Xi. For any > 0, 3) Let V = HDA, and recall V has orthonormal columns Suppose P in the S = PHD definition samples uniformly with replacement. If row i is sampled in the j-th sample, then Pj,i= n, and is 0 otherwise Let Yi be the i-th sampled row of V = HDA TYi Let Xi= Id n Yi 1 nVi TVi= Id VTV = 0d E Xi = Id n j 2= 1 + n C2 log nd d nd ) Xi 2 Id 2+ n max ejHDA2 n= (dlog 35

  36. Matrix Chernoff Bound Recall: let Yi be the i-th sampled row of V = HDA Let Xi= Id n Yi E XTX + Id = Id+ Id 2n E Yi TYi TYi+ n2E[Yi TYiYi TYi] 1 n vi nd Tvivi Tvi= n ivi Tvi ?? 2 = 2Id 2Id+ n2 i 2 d nd ? TviC2 log Define Z = n ivi Note that ??? + ?? and Z are real symmetric, with non-negative eigenvalues Claim: for all vectors y, we have: ?????? + ??? ???? Proof: ?????? + ??? = ? ????? ?? n= C2dlog ?? ???? ?? 2 ? ?= ? 2 and ?? ? 2= ? ?< ??,? >2?? 2 ???? ?2log ???? = ? ???? < ??,? >2?2log ? ? ? Hence, ? ??? 2 ? ??? + ?? 2+ ?? 2= |? ??? + ??|2+ 1 ?? ? ?2+ 1 ?2?log + 1 nd Hence, E XTX 2= O d log 36

  37. Matrix Chernoff Bound nd Hence, E XTX 2= O d log Recall: (Matrix Chernoff) Let X1, ,Xs be independent copies of a symmetric random matrix X Rdxd with E X = 0, X2 , and E XTX ?2. Let W =1 2 s i [s]Xi. For any > 0,Pr W2> 2d e s 2/(?2+ 3) 2> 2d e s 2/( (d lognd ) Pr |Id PHDATPHDA d log nd , to make this probability less than Set s = d log 2 2 37

  38. SRHT Wrapup Have shown |Id PHDATPHDA |2< using Matrix log 2 samples d nd Chernoff Bound and with s = d log Implies for every unit vector x, |1 PHDAx2 so PHDAx2 2| = xTx x PHDATPHDA x < , 2 1 for all unit vectors x Considering the column span of A adjoined with b, we can again solve the regression problem The time for regression is now only O(nd log n) + poly(? log ? ? 38 ). Nearly optimal in matrix dimensions (n >> d)

  39. Faster Subspace Embeddings S [CW,MM,NN] CountSketch matrix Define k x n matrix S, for k = O(d2/ 2) S is really sparse: single randomly chosen non-zero entry per column [ [ 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 Can compute S A in nnz(A) time! nnz(A) is number of non-zero entries of A 39

  40. Simple Proof [Nguyen] Can assume columns of A are orthonormal Suffices to show |SAx|2 = 1 for all unit x For regression, apply S to [A, b] SA is a 2d2/ 2 x d matrix Suffices to show | ATST SA I|2 |ATST SA I|F Matrix product result shown below: Pr[|CSTSD CD|F2 [3/(?(# rows of S))] * |C|F2 |D|F2] 1 ? Set C = AT and D = A. Then |A|2F = d and (# rows of S) = 3 d2/( 2) 40

  41. Matrix Product Result [Kane, Nelson] Show: Pr[|CSTSD CD|F2 [3/(?(# rows of S))] * |C|F2 |D|F2] 1 ? (JL Property) A distribution on matrices S Rkx n has the ( , , )-JL moment property if for all x Rn with x2= 1, ES Sx2 2 1 (From vectors to matrices) For , 0,1 matrices S with k rows and n columns that satisfies the ( , , )-JL moment property for some 2. Then for A, B matrices with n rows, 2, let D be a distribution on ATSTSB ATBF 3 AFBF Pr S 41

  42. From Vectors to Matrices (From vectors to matrices) For , 0,1 S with k rows and n columns that satisfies the ( , , )-JL moment property for some 2. Then for A, B matrices with n rows, 2, let D be a distribution on matrices ATSTSB ATBF 3 AFBF Pr S Proof: For a random scalar X, let Xp= E Xp 1/p Sometimes consider X = TF for a random matrix T 1/p p | TF |p= E TF Can show |.|p is a norm Minkowski s Inequality: X + Yp Xp+ Yp For unit vectors x, y, need to bound | Sx,Sy - x,y | 42

  43. From Vectors to Matrices For unit vectors x, y, | Sx,Sy - x,y | = 2( Sx2 1 2 ( Sx2 1 2 ( 1 1 2 1 + 2 1 2 x y2 2| Sy2 S x y 2 2 1 + Sy2 1 + 2 1 + S x y 1 ) 2 x y2 2 ) 2 1 + x y2 2 3 1 Sx,Sy x,y x2y2 By linearity, for arbitrary x, y, 3 Suppose A has d columns and B has e columns. Let the columns of A be A1, ,Ad and the columns of B be B1, ,Be 1 Define Xi,j= ( SAi,SBj Ai,Bj) Ai 2Bj2 43 2= i jAi 2 2Xi,j 2 2 Bj2 ATSTSB ATBF

  44. From Vectors to Matrices 1 Sx,Sy x,y x2y2 Have shown: for arbitrary x, y, 3 2= i jAi 2 2Xi,j 1 2 For Xi,j= 2 Bj2 SAi,SBj Ai,Bj : ATSTSB ATBF Ai 2Bj2 2| /2= i jAi 2 2Xi,j 2 2 Bj2 | ATSTSB ATBF /2 2|Xi,j 2| /2 i jAi 2 2 Bj2 2Xi,j 2 = i jAi 2 2 Bj2 2 1 2 2Bj2 3 i jAi 2 2 1 2BF 2 = 3 AF /2 2 Since E ATSTSB ATBF ATSTSB ATBF , by Markov s inequality, = 2 1 Pr ATSTSB ATBF> 3 AFBF E |ATSTSB ATB|F 3 AFBF 44

  45. Result for Vectors Show: Pr[|CSTSD CD|F2 [3/(?(# rows of S))] * |C|F2 |D|F2] 1 ? (JL Property) A distribution on matrices S Rkx n has the ( , , )-JL moment property if for all x Rn with x2= 1, ES Sx2 2 1 (From vectors to matrices) For , 0,1 S with k rows and n columns that satisfies the ( , , )-JL moment property for some 2. Then for A, B matrices with n rows, 2, let D be a distribution on matrices ATSTSB ATBF 3 AFBF Pr S Just need to show that the CountSketch matrix S satisfies JL property and bound the number k of rows 45

  46. CountSketch Satisfies the JL Property (JL Property) A distribution on matrices S Rkx n has the ( , , )-JL moment property if for all x Rn with x2= 1, ES Sx2 2 1 We show this property holds with = 2. First, let us consider = 1 For CountSketch matrix S, let h:[n] -> [k] be a 2-wise independent hash function : n { 1,1} be a 4-wise independent hash function Let E = 1 if event E holds, and E = 0 otherwise 2 ] 2] = j kE[ i n h i = j ixi = j k i1,i2 [n]E[ h i1 = j h i2 = j i1 i2]xi1xi2 = j k i [n]E[ h i = j2]xi E[ Sx2 2 1 k j [k[ i nxi 2= x2 2 = 46

  47. CountSketch Satisfies the JL Property 2 2] = 4] = ?[ j k ? ? i n h i = j ixi i n h i = j i xi E[ Sx2 ?1,?2,?1.?2,?3,?4?[??1??2??3??4? ?1 = ?1? ?2 = ?1? ?3 = ?2? ?4= ?2 ]??1 ??2??3??4 We must be able to partition {?1,?2,?3,?4} into equal pairs 1 ? ??? 4= ?4 Suppose ?1= ?2= ?3= ?4. Then necessarily ?1= ?2. Obtain ? 4 1 2??3 2= ?2 Suppose ?1= ?2 and ?3= ?4 but ?1 ?3. Then get ?1,?2,?1,?3 4 ?4 4 ?2??1 Suppose ?1= ?3 and ?2= ?4 but ?1 ?2. Then necessarily ?1= ?2. Obtain ? 1 1 ??2 2??2 2 4. Obtain same bound if ?1= ?4 and ?2= ?3. ?2 ?1,?2??1 4(1 +2 ?)] = [1,1 +2 2 ?. Setting k = Hence, E[ Sx2 4] [ x2 4, ?2 k] 2 1 +2 47 1 So, ES Sx2 2?2?finishes the proof 2 1 ? 2 + 1 =

  48. Where are we? (JL Property) A distribution on matrices S Rkx n has the ( , , )-JL moment property if for all x Rn with x2= 1, ES Sx2 2 1 (From vectors to matrices) For , 0,1 matrices S with k rows and n columns that satisfies the ( , , )-JL moment property for some 2. Then for A, B matrices with n rows, 2, let D be a distribution on 2 2BF 2 ATSTSB ATB F 3 2AF Pr S 2 We showed CountSketch has the JL property with = 2, ??? ? = ?2? Matrix product result we wanted was: Pr[|CSTSD CD|F2 (3/(?k)) * |C|F2 |D|F2] 1 ? We are now down with the proof CountSketch is a subspace embedding 48

  49. Course Outline Subspace embeddings and least squares regression Gaussian matrices Subsampled Randomized Hadamard Transform CountSketch Affine embeddings Application to low rank approximation High precision regression Leverage score sampling 49

  50. Affine Embeddings Want to solve min number of columns 2, A is tall and thin with d columns, but B has a large ?? ?? ? Can t directly apply subspace embeddings Let s try to show SAX SBF= 1 AX BF for all X and see what properties we need of S Can assume A has orthonormal columns Let ? = ?? ?, where ? is the optimum 2 ?? 2= ?? ? ? + ? ?? ? 2 2??[ ? ? ??????? ] 2 2 ? ? 2 2? ? ? 2 ?( ? ? ? ? 2 ?? 2 = ?? ? ? ?? ? ? ?? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? (use ?? ?? ????) F (if we have approx. matrix product) ?? ) (subspace embedding for A) ??????? ?B 2+ 2 ? ? ? ? 50 ?

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#