Demystifying Kernels: A Simplified Approach without Complicated Math

 
Kernels for dummies
 
Tea talk
September 22, 2014
 
I find kernel talks confusing.
 
-
Equations seem to appear out of nowhere.
-
It’s hard for me to extract the main message.
-
I still don’t know what an RKHS is.
-
Which is a little strange, considering I’ve know about
Hilbert spaces since before most of you were born.
 
So today we’re going to attempt to demystify kernels.
 
My claim: 
if you understand linear algebra, you’ll
understand kernels
.
 
There will be no RKHSs in this talk.
And no Polish spaces.
And no mean embeddings.
But also no proofs; basically I’m going to avoid the hard
stuff.
 
Instead, just the intuition!
 
We’ll start with a pretty standard problem:
 
-
You have samples from some distribution, 
p
(
x
).
-
You want to minimize
 
  
 F
(
dx
 
f
(
x
) 
p
(
x
)
)
 
      with respect to 
f
(
x
).
 
I’ll give examples later, but just about every kernel talk
you have ever heard considers this problem, or a slight
generalization.
f
(
x
) = arg min
 
F
(
dx
 
f
(
x
) 
p
(
x
)
)
f
(
x
)
Our problem:
 
x
 ~ 
p
 
p
(
x
)
~
~
f
(
x
) = arg min
 
F
(
dx
 
f
(
x
) 
p
(
x
)
)
Our problem:
p
(
x
)
x
 ~ 
p
 
p
(
x
) =      
i
 
(
x
-
x
i
)
 
1
 
n
f
(
x
)
~
~
f
(
x
) = arg min
 
F
(
dx
 
f
(
x
) 
p
(
x
)
)
Our problem:
 
f
(
x
) = +
(
x
-
x
1
)
 
=> 
dx
 
f
(
x
) 
p
(
x
) = +
f
(
x
) = –
(
x
-
x
1
)
 
=> 
dx
 
f
(
x
) 
p
(
x
) = –
 
By suitably adjusting 
f
(
x
), 
dx
 
f
(
x
) 
p
(
x
)
 can
range from –
 to +
.
 
Therefore, we have to regularize 
f
(
x
):
we need a smoothness constraint.
dx
 
f
(
x
)
 
i
 
(
x
-
x
i
) 
1
n
f
(
x
)
~
~
 
If you’re Bayesian, you put a prior over 
f
(
x
).
 
If you’re a kernel person you demand that
 
 
dxdy
  
f
(
x
) 
K
-1
(
x
, 
y
)
 f
 
(
y
)
 
is in some sense small.
 
 
K
(
x
, 
y
)
 is a Kernel.
 
For example, 
K
(
x
, 
y
)
 = exp(-(
x
-
y
)
2
/2).
 
An aside:
 
 
<f, g> = 
dxdy
  
f
(
x
) 
K
-1
(
x
, 
y
)
 g
 
(
y
).
 
If you’re Bayesian, you put a prior over 
f
(
x
).
 
If you’re a kernel person you demand that
 
 
dxdy
  
f
(
x
) 
K
-1
(
x
, 
y
)
 f
 
(
y
)
 
is in some sense small.
 
 
K
(
x
, 
y
)
 is a Kernel.
 
For example, 
K
(
x
, 
y
)
 = exp(-(
x
-
y
)
2
/2).
 
This raises two questions:
 
  1. How do we make sense of 
K
-1
(
x
, 
y
)
?
  2. What does
 
 
 
 
       have to do with smoothness?
 
dxdy
  
f
(
x
) 
K
-1
(
x
, 
y
)
 f
 
(
y
)
 
1. Making sense of 
K
-1
(
x
, 
y
)
:
 
 
dy
 
K
-1
(
x
, 
y
)
 
K
(
y
, 
z
) 
= 
(
x – z
)
 
defines 
K
-1
(
x
, 
y
)
.
 
Think of 
K
 as a matrix. 
K
-1
 is its inverse.
 
K
 has an uncountably infinite number of indices.
But otherwise it’s a very standard matrix.
 
K
-1
 exists if all the eigenvalues of 
K
 are positive.
 
An aside: 
K
-1
 doesn’t really exist. But that’s irrelevant.
 
2. What does
 
 
dxdy
  
f
(
x
) 
K
-1
(
x
, 
y
)
 f
 
(
y
)
 
have to do with smoothness?
 
I’ll answer for a specific case: translation invariant
kernels,
 
 
 K
(
x
, 
y
) = 
K
(
x
y
).
 
 
 
 
dxdy
  
f
(
x
) 
K
-1
(
x
, 
y
)
 f
 
(
y
)
 
       =
 
dxdy
  
f
(
x
) 
K
-1
(
x
y
)
 f
 
(
y
)
 
       =
 
dk
 
 |
f
(
k
)|
2
 
K
(
k
)
 
Fourier transform of 
f
(
x
)
 
Fourier transform of 
K
(
x
)
 
translation invariance
 
Fourier transform
 
 
 
 
dxdy
  
f
(
x
) 
K
-1
(
x
, 
y
)
 f
 
(
y
)
 
       =
 
dxdy
  
f
(
x
) 
K
-1
(
x
y
)
 f
 
(
y
)
 
       =
 
dk
 
-
For smooth Kernels, 
K
(
k
)
 
falls off rapidly with 
k
.
-
For the above integral to be small, 
f
(
k
) must fall off
rapidly with 
k
.
-
In other words, 
f
(
k
) must be smooth.
translation invariance
Fourier transform
 |
f
(
k
)|
2
K
(
k
)
 
 
 
 
dxdy
  
f
(
x
) 
K
-1
(
x
, 
y
)
 f
 
(
y
)
 
       =
 
dxdy
  
f
(
x
) 
K
-1
(
x
y
)
 f
 
(
y
)
 
       =
 
dk
 
 
Example:
 
     
K
(
x
)
 = exp(-
x
2
/2)
 
=>
  
     
K
(
k
)
 
 exp(-
k
2
/2)
 |
f
(
k
)|
2
K
(
k
)
 
 
dk 
|
f
(
k
)|
2
 
exp(
+
k
2
/2)
 
More generally,
 
 
 
dy
  
K
(
x
, 
y
)
 g
k
(
y
) = 
(
k
) 
g
k
(
x
)
 
=>
 
 
dxdy
  
f
(
x
) 
K
-1
(
x
, 
y
)
 f
 
(
y
) = 
dk
 
 
Typically,
 
 
(
k
) falls of with 
k
 
g
k
(
x
) become increasingly rough with 
k
 
[
dx
 
f
(
x
) 
g
k
(
x
)
]
2
 
(
k
)
 
Finally, let’s link this to linear algebra
 
 
dx
 
f
(
x
) 
g
(
x
)
  
 
f
g
 
dx
 
f
(
x
) 
A
(
x
, 
y
)
 
 f
A
(
y
)
 
=>
 
dxdy
  
f
(
x
) 
K
-1
(
x
, 
y
)
 f
 
(
y
)  = 
f
K
-1
f
 
Compare to:
 
 
i
 
x
i
 
y
i
 
 
= x
y
 
j
 
A
ij
 
x
j
 
 
= (A
x)
i
 
ij
 
x
i
 
A
ij
 
x
j
 
 
= x
A
-1
x
 
Integrals are glorified sums!
 
-1
f = arg min
 
F(
f
p)
f
Our problem:
f
K
-1
f is small
 
Two notions of small:
 
[ F(
f
p) + 
 f
K
-1
f ] = 0
 
d
 
d
f
 
1.
Lagrange multipliers: 
f
K
-1
f
 = constant.

 fixed
      an aside: 
 f
K
-1
f
 can often be thought of as
      coming from a prior.
~
~
 
[ F(
f
p) + 
 f
K
-1
f ] = 0
 
d
 
d
f
 
is easy to solve:
 
 
F
(
f
p) p + 2
 
K
-1
f = 0
 
=>
 
f = –
 
Remember:
 
 
p
(
x
) =      
i
 
(
x
-
x
i
)
 
=>
 
K
p
(x) =       
i
 
K
(
x
, 
x
i
)
 
F
(
f
p) 
K
p
 
2
 
1
 
n
 
1
 
n
{f
1
, f
2
, …} = arg min
 
F(
f
1
p
1
, f
2
p
2
, …)
The more general problem:
the f
i
K
-1
f
 i
 are small
f
1
, 
f
2
, …
 
Almost all kernel related problems fall into this class.
Those problems are fully specified by:
 
     the functional, 
F(
f
1
p
1
, f
2
p
2
, …)
, to be minimized
     what one means by small (e.g., 
f
i
K
-1
f
 i
 = 
c
i
)
 
The rest is (typically very straightforward) algebra.
~
~
~
~
 
Three examples:
 
 
1. A “witness” function.
 
2. Ridge regression.
 
3. Kernel PCA (which is a little bit different).
 
1. A “witness” function.
 
Maximize
 
 
[f
(p-q)]
2
 
subject to the constraint
 
 
f
K
-1
f = 1
 
p
 and 
q
 are sums of delta functions.
 
dx
 
f
(
x
) [
p
(
x
) - 
q
(
x
)
]
 
[
 
]
 
2
 
dxdy
  
f
(
x
) 
K
-1
(
x
, 
y
)
 f
 
(
y
)
 
1. A “witness” function.
 
Maximize:
    
[f
(p-q)]
2
 
subject to the constraint:
 
f
K
-1
f = 1
 
Lagrange multipliers:
 
 
( [f
(p-q)]
2
 
 
 f
K
-1
f  ) = 0
 
=>
 
f =
 
=>
 
[f
(p-q)]
2
 = (p-q)
K
(p-q)
 
d
 
d
f
 
K
(p-q)
 
[(p-q)
K
(p-q)]
1/2
1. A “witness” function.
 
p
K
p = 
dxdy
  
p
(
x
) 
K
(
x
, 
y
)
 p
 
(
y
)
 
j
 
(
x
-
x
j
)
 
1
 
n
 
j
 
(
y
-
x
j
)
 
1
 
n
 
=       
ij
 
K
(
x
i
, 
x
j
)
 
1
 
n
2
 
We didn’t mention RKHSs
We didn’t mention mean embeddings
All we did was linear algebra
1. A “witness” function.
 
( [f
(p-q)]
2
 
 
 f
K
-1
f  ) = 0
=>
 
f =
=>
 
 [f
(p-q)]
2
 = (p-q)
K
(p-q)
d
d
f
K
(p-q)
[(p-q) 
K
(p-q)]
1/2
 
~50% of Arthur’s Gatsby job talk.
I do not mean to trivialize the work of kernel people.
But I do want to point out that the setup is almost
always straightforward.
 
2. Ridge regression.
 
minimize
  
i
 (
y
i
 – f
p
i
)
2
 + 
 f
K
-1
f
 
   
with respect to f.
 
 
i
 labels observations
 
the
 y
i
 are observed (they’re scalars)
 
we have samples from the distributions 
p
i
(
x
)
 
 is fixed
 
Ridge regression (with a kernel twist).
 
2. Ridge regression.
 
solution (very straightforward algebra):
 
 
f* = 
i
 
i
 K
p
i
 
 
i
 = 
i
 (B + 
I)
ij  
y
i
 
-1
 
identity matrix
 
B
ij
 = 
p
i
K
p
j
 
=          
mn
 
K
(
x
m 
, 
x
n
)
 
1
 
n
i 
n
j
 
i
 
j
 
2. Ridge regression.
 
solution (very straightforward algebra):
 
 
f* = 
i
 
i
 K
p
i
 
 
i
 = 
i
 (B + 
I)
ij  
y
i
 
-1
 
We didn’t mention RKHSs
We didn’t mention mean embeddings
All we did was linear algebra
2. Ridge regression.
minimize 
i
 (
y
i
 – f
p
i
)
2
 + 
 f
K
-1
f  with respect to f
 
f* = 
i
 
i
 K
p
i
 
 
i
 = 
i
 (B + 
I)
ij  
y
i
-1
 
~50% of Zoltan’s second to last research talk.
I do not mean to trivialize the work of kernel people.
But I do want to point out that the setup is almost
always straightforward.
 
3. Kernel PCA (which is a little bit different).
 
We have a set of points (in, for instance, Euclidean space),
 
 
z
i
 , 
i
=1, …, 
n
.
 
We want to project them into a higher dimensional
space, and do PCA in that space.
 
Why not go to the extreme, and project them into an
infinite dimensional space?
 
 
f
i
(x) = 
K
(z
i
, x)
 
3. Kernel PCA (which is a little bit different).
 
Now we have a set of points (in function space),
 
 
f
i
 , 
i
=1, …, 
n
.
 
We want to find a lower dimensional manifold that
captures as much variance as possible. If this were
standard PCA, we would minimize
 
 
i
 (f
i
 - 
j
 
A
ij
v
j
) 
 (f
i
 - 
j
 
A
ij
v
j
)
 
with respect to 
A
ij
 and v
j
 .
 
3. Kernel PCA (which is a little bit different).
 
Remember,
 
 
 (f
i
 - 
k
 
A
ij
 v
k
) 
 (f
i
 - 
k
 
A
ij
 v
k
)
 
is shorthand for
 
 
 
d
x (
f
i
(x) - 
j
 
A
ij
v
j
(x)
) (
f
i
(x) - 
j
 
A
ij
v
j
(x)
)
 
But we can mess with the norm to emphasize smoothness,
 
 
 
d
x
d
y (
f
i
(x) - 
j
 
A
ij
v
j
(x)
) 
Q
-1
(x,y)
 (
f
i
(y) - 
j
 
A
ij
v
j
(y)
)
 
3. Kernel PCA (which is a little bit different).
 
and minimize
 
 
 
i
 (f
i
 - 
j
 
A
ij
v
j
) 
Q
-1
 (f
i
 - 
j
 
A
ij
v
j
)
 
with respect to 
A
ij
 and v
j
 .
 
If we set 
Q
 = 
K
, we get standard kernel PCA.
 
That’s the most convenient choice, because it makes
it easy to compute 
A
ij
 and v
j 
.
 
I don’t know if there are any other justifications.
 
f
1
, 
f
2
, …
 
{f
1
, f
2
, …} = arg min
 
F(
f
1
p
1
, f
2
p
2
, …)
 
Most (almost all?) kernel problems are of the form
 
the f
i
K
-1
f
 i
 are small
Summary
 
Specify
 
     the functional,
 F(
f
1
p
1
, f
2
p
2
, …)
, to be minimized
     what one means by small (e.g., 
f
i
K
-1
f
 i
 = 
c
i
),
 
and rest is (typically very straightforward) algebra.
 
~
 
~
 
~
 
~
 
[ F(
f
p) + 
 f
K
-1
f ] = 0
 
d
 
d
f
 
f = –
 
F
(
f
p) 
K
p
 
2
 
The typical problem:
 
 
 
 
The solution (two lines of algebra)
 
There is no reason (I can find) to mention RKHSs
or mean embeddings.
 
All quantities one needs arise very naturally as the
solution to the problem one has proposed.
Slide Note
Embed
Share

Kernels are often confusing, but this talk aims to make them easy to understand. By focusing on intuition rather than complex equations, the speaker explains how kernels relate to linear algebra concepts. The talk covers the basic problem of minimizing a function with respect to a distribution and introduces the idea of regularization and kernels in a straightforward manner, making it accessible even to those new to the topic.

  • Kernels
  • Linear Algebra
  • Intuition
  • Regularization
  • Simplified Explanation

Uploaded on Aug 30, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Kernels for dummies Tea talk September 22, 2014

  2. I find kernel talks confusing. Equations seem to appear out of nowhere. It s hard for me to extract the main message. I still don t know what an RKHS is. Which is a little strange, considering I ve know about Hilbert spaces since before most of you were born. - - - -

  3. So today were going to attempt to demystify kernels. My claim: if you understand linear algebra, you ll understand kernels. There will be no RKHSs in this talk. And no Polish spaces. And no mean embeddings. But also no proofs; basically I m going to avoid the hard stuff. Instead, just the intuition!

  4. Well start with a pretty standard problem: You have samples from some distribution, p(x). You want to minimize - - F( dxf(x) p(x)) with respect to f(x). I ll give examples later, but just about every kernel talk you have ever heard considers this problem, or a slight generalization.

  5. Our problem: ~ f(x) = arg min F( dxf(x) p(x)) f(x) ~ p(x) x ~ p

  6. Our problem: ~ f(x) = arg min F( dxf(x) p(x)) f(x) ~ p(x) x ~ p 1 p(x) = i (x-xi) n

  7. 1 n dxf(x) i (x-xi) Our problem: ~ f(x) = arg min F( dxf(x) p(x)) f(x) ~ f(x) = + (x-x1) f(x) = (x-x1) => dxf(x) p(x) = + => dxf(x) p(x) = By suitably adjusting f(x), dxf(x) p(x) can range from to + . Therefore, we have to regularize f(x): we need a smoothness constraint.

  8. If youre Bayesian, you put a prior over f(x). If you re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).

  9. An aside: <f, g> = dxdyf(x) K-1(x, y) g(y).

  10. If youre Bayesian, you put a prior over f(x). If you re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).

  11. This raises two questions: 1. How do we make sense of K-1(x, y)? 2. What does dxdyf(x) K-1(x, y) f(y) have to do with smoothness?

  12. 1. Making sense of K-1(x, y): dyK-1(x, y) K(y, z) = (x z) defines K-1(x, y). Think of K as a matrix. K-1 is its inverse. K has an uncountably infinite number of indices. But otherwise it s a very standard matrix. K-1 exists if all the eigenvalues of K are positive. An aside: K-1doesn t really exist. But that s irrelevant.

  13. 2. What does dxdyf(x) K-1(x, y) f(y) have to do with smoothness? I ll answer for a specific case: translation invariant kernels, K(x, y) = K(x y).

  14. dxdyf(x) K-1(x, y) f(y) = dxdyf(x) K-1(x y) f(y) |f(k)|2 K(k) translation invariance = dk Fourier transform Fourier transform of f(x) Fourier transform of K(x)

  15. dxdyf(x) K-1(x, y) f(y) = dxdyf(x) K-1(x y) f(y) |f(k)|2 K(k) translation invariance = dk Fourier transform For smooth Kernels, K(k) falls off rapidly with k. For the above integral to be small, f(k) must fall off rapidly with k. In other words, f(k) must be smooth. - - -

  16. dxdyf(x) K-1(x, y) f(y) = dxdyf(x) K-1(x y) f(y) |f(k)|2 K(k) dk |f(k)|2exp(+k2/2) = dk Example: K(x) = exp(-x2/2) K(k) exp(-k2/2) =>

  17. More generally, dyK(x, y) gk(y) = (k) gk(x) => [ dxf(x) gk(x)]2 (k) dxdyf(x) K-1(x, y) f(y) = dk Typically, (k) falls of with k gk(x) become increasingly rough with k

  18. Finally, lets link this to linear algebra dxf(x) g(x) dxf(x) A(x, y) f g f A(y) dxdyf(x) K-1(x, y) f(y) = f K-1 f => Compare to: ixiyi jAijxj ijxiAijxj = x A-1 x = x y = (A x)i -1 Integrals are glorified sums!

  19. Our problem: ~ f = arg min F(f p) f f K-1 f is small ~ Two notions of small: d [ F(f p) + f K-1 f ] = 0 df 1. Lagrange multipliers: f K-1 f = constant. 2. fixed an aside: f K-1 f can often be thought of as coming from a prior.

  20. d [ F(f p) + f K-1 f ] = 0 df is easy to solve: F (f p) p + 2 K-1 f = 0 F (f p) K p 2 => f = Remember: 1 p(x) = i (x-xi) n 1 K p(x) = iK(x, xi) n =>

  21. The more general problem: ~ ~ {f1, f2, } = arg min F(f1 p1, f2 p2, ) f1, f2, ~ ~ the fi K-1 f i are small Almost all kernel related problems fall into this class. Those problems are fully specified by: the functional, F(f1 p1, f2 p2, ), to be minimized what one means by small (e.g., fi K-1 f i = ci) The rest is (typically very straightforward) algebra.

  22. Three examples: 1. A witness function. 2. Ridge regression. 3. Kernel PCA (which is a little bit different).

  23. 1. A witness function. Maximize dxf(x) [p(x) - q(x)] [ ]2 [f (p-q)]2 subject to the constraint dxdyf(x) K-1(x, y) f(y) f K-1 f = 1 p and q are sums of delta functions.

  24. 1. A witness function. [f (p-q)]2 Maximize: f K-1 f = 1 subject to the constraint: Lagrange multipliers: d ( [f (p-q)]2 f K-1 f ) = 0 df K (p-q) [(p-q) K (p-q)]1/2 => f = [f (p-q)]2 = (p-q) K (p-q) =>

  25. 1. A witness function. p K p = dxdyp(x) K(x, y) p(y) 1 n 1 n j (x-xj) j (y-xj) 1 = ijK(xi, xj) n2 We didn t mention RKHSs We didn t mention mean embeddings All we did was linear algebra

  26. 1. A witness function. d ( [f (p-q)]2 f K-1 f ) = 0 df K (p-q) [(p-q) K (p-q)]1/2 => f = [f (p-q)]2 = (p-q) K (p-q) => ~50% of Arthur s Gatsby job talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.

  27. 2. Ridge regression. i (yi f pi)2 + f K-1 f minimize with respect to f. i labels observations the yiare observed (they re scalars) we have samples from the distributions pi(x) is fixed Ridge regression (with a kernel twist).

  28. 2. Ridge regression. solution (very straightforward algebra): f* = i i K pi -1 i = i (B + I)ij yi identity matrix Bij = pi K pj 1 = mnK(xm , xn) ni nj i j

  29. 2. Ridge regression. solution (very straightforward algebra): f* = i i K pi -1 i = i (B + I)ij yi We didn t mention RKHSs We didn t mention mean embeddings All we did was linear algebra

  30. 2. Ridge regression. minimize i (yi f pi)2 + f K-1 f with respect to f f* = i i K pi -1 i = i (B + I)ij yi ~50% of Zoltan s second to last research talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.

  31. 3. Kernel PCA (which is a little bit different). We have a set of points (in, for instance, Euclidean space), zi , i=1, , n. We want to project them into a higher dimensional space, and do PCA in that space. Why not go to the extreme, and project them into an infinite dimensional space? fi(x) = K(zi, x)

  32. 3. Kernel PCA (which is a little bit different). Now we have a set of points (in function space), fi , i=1, , n. We want to find a lower dimensional manifold that captures as much variance as possible. If this were standard PCA, we would minimize i (fi - jAijvj) (fi - jAijvj) with respect to Aij and vj .

  33. 3. Kernel PCA (which is a little bit different). Remember, (fi - kAij vk) (fi - kAij vk) is shorthand for dx (fi(x) - jAijvj(x)) (fi(x) - jAijvj(x)) But we can mess with the norm to emphasize smoothness, dxdy (fi(x) - jAijvj(x)) Q-1(x,y) (fi(y) - jAijvj(y))

  34. 3. Kernel PCA (which is a little bit different). and minimize i (fi - jAijvj) Q-1 (fi - jAijvj) with respect to Aij and vj . If we set Q = K, we get standard kernel PCA. That s the most convenient choice, because it makes it easy to compute Aij and vj . I don t know if there are any other justifications.

  35. Summary Most (almost all?) kernel problems are of the form ~ ~ {f1, f2, } = arg min F(f1 p1, f2 p2, ) ~ ~ f1, f2, the fi K-1 f i are small Specify the functional, F(f1 p1, f2 p2, ), to be minimized what one means by small (e.g., fi K-1 f i = ci), and rest is (typically very straightforward) algebra.

  36. The typical problem: d [ F(f p) + f K-1 f ] = 0 df The solution (two lines of algebra) f = F (f p) K p 2

  37. There is no reason (I can find) to mention RKHSs or mean embeddings. All quantities one needs arise very naturally as the solution to the problem one has proposed.

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#