Demystifying Kernels: A Simplified Approach without Complicated Math

Kernels for dummies

Tea talk

September 22, 2014

I find kernel talks confusing.

Equations seem to appear out of nowhere.

It’s hard for me to extract the main message.

I still don’t know what an RKHS is.

Which is a little strange, considering I’ve know about

Hilbert spaces since before most of you were born.

So today we’re going to attempt to demystify kernels.

My claim:

if you understand linear algebra, you’ll

understand kernels

There will be no RKHSs in this talk.

And no Polish spaces.

And no mean embeddings.

But also no proofs; basically I’m going to avoid the hard

stuff.

Instead, just the intuition!

We’ll start with a pretty standard problem:

You have samples from some distribution,

).

You want to minimize



dx

      with respect to

).

I’ll give examples later, but just about every kernel talk

you have ever heard considers this problem, or a slight

generalization.

) = arg min



dx

Our problem:

) = arg min



dx

Our problem:

) =





) = arg min



dx

Our problem:

) = +



=>



dx

) = +



) = –



=>



dx

) = –



By suitably adjusting

),



dx

can

range from –



 to +



Therefore, we have to regularize

):

we need a smoothness constraint.



dx





If you’re Bayesian, you put a prior over

).

If you’re a kernel person you demand that



dxdy

-1

is in some sense small.

 is a Kernel.

For example,

 = exp(-(

/2).

An aside:

<f, g> =



dxdy

-1

).

If you’re Bayesian, you put a prior over

).

If you’re a kernel person you demand that



dxdy

-1

is in some sense small.

 is a Kernel.

For example,

 = exp(-(

/2).

This raises two questions:

  1. How do we make sense of

-1

  2. What does

       have to do with smoothness?



dxdy

-1

1. Making sense of

-1



dy

-1



x – z

defines

-1

Think of

 as a matrix.

-1

 is its inverse.

 has an uncountably infinite number of indices.

But otherwise it’s a very standard matrix.

-1

 exists if all the eigenvalues of

 are positive.

An aside:

-1

 doesn’t really exist. But that’s irrelevant.

2. What does



dxdy

-1

have to do with smoothness?

I’ll answer for a specific case: translation invariant

kernels,

) =

–

).



dxdy

-1



dxdy

-1

–



dk

)|

Fourier transform of

Fourier transform of

translation invariance

Fourier transform



dxdy

-1



dxdy

-1

–



dk

For smooth Kernels,

falls off rapidly with

For the above integral to be small,

) must fall off

rapidly with

In other words,

) must be smooth.

translation invariance

Fourier transform

)|



dxdy

-1



dxdy

-1

–



dk

Example:

 = exp(-

/2)

=>



 exp(-

/2)

)|





dk

)|

exp(

/2)

More generally,



dy

) =



=>



dxdy

-1

) =



dk

Typically,



) falls of with

) become increasingly rough with



dx



Finally, let’s link this to linear algebra



dx







dx





=>



dxdy

-1

)  =



-1



Compare to:



= x





ij

= (A



x)



ij

ij

= x



-1



Integrals are glorified sums!

-1

f = arg min

F(



p)

Our problem:



-1



f is small

Two notions of small:

[ F(



p) +





-1



f ] = 0

1.

Lagrange multipliers:



-1



 = constant.





 fixed

      an aside:





-1



 can often be thought of as

      coming from a prior.

[ F(



p) +





-1



f ] = 0

is easy to solve:





p) p + 2



-1



f = 0

=>

f = –

Remember:

) =





=>



(x) =







p)





{f

, f

, …} = arg min

F(



, f



, …)

The more general problem:

the f



-1



 are small

, …

Almost all kernel related problems fall into this class.

Those problems are fully specified by:

     the functional,

F(



, f



, …)

, to be minimized

     what one means by small (e.g.,



-1



The rest is (typically very straightforward) algebra.

Three examples:

1. A “witness” function.

2. Ridge regression.

3. Kernel PCA (which is a little bit different).

1. A “witness” function.

Maximize

[f



(p-q)]

subject to the constraint



-1



f = 1

and

 are sums of delta functions.



dx

) [

) -



dxdy

-1

1. A “witness” function.

Maximize:

[f



(p-q)]

subject to the constraint:



-1



f = 1

Lagrange multipliers:

( [f



(p-q)]

–





-1



f  ) = 0

=>

f =

=>

[f



(p-q)]

 = (p-q)





(p-q)



(p-q)

[(p-q)





(p-q)]

1/2

1. A “witness” function.





p =



dxdy











ij

We didn’t mention RKHSs

We didn’t mention mean embeddings

All we did was linear algebra

1. A “witness” function.

( [f



(p-q)]

–





-1



f  ) = 0

=>

f =

=>

[f



(p-q)]

 = (p-q)





(p-q)



(p-q)

[(p-q)





(p-q)]

1/2

~50% of Arthur’s Gatsby job talk.

I do not mean to trivialize the work of kernel people.

But I do want to point out that the setup is almost

always straightforward.

2. Ridge regression.

minimize



 – f







-1



with respect to f.

 labels observations

the

 are observed (they’re scalars)

we have samples from the distributions



 is fixed

Ridge regression (with a kernel twist).

2. Ridge regression.

solution (very straightforward algebra):

f* =











 (B +



I)

ij

-1

identity matrix

ij







mn

2. Ridge regression.

solution (very straightforward algebra):

f* =











 (B +



I)

ij

-1

We didn’t mention RKHSs

We didn’t mention mean embeddings

All we did was linear algebra

2. Ridge regression.

minimize



 – f







-1



f  with respect to f

f* =











 (B +



I)

ij

-1

~50% of Zoltan’s second to last research talk.

I do not mean to trivialize the work of kernel people.

But I do want to point out that the setup is almost

always straightforward.

3. Kernel PCA (which is a little bit different).

We have a set of points (in, for instance, Euclidean space),

=1, …,

We want to project them into a higher dimensional

space, and do PCA in that space.

Why not go to the extreme, and project them into an

infinite dimensional space?

(x) =

(z

, x)

3. Kernel PCA (which is a little bit different).

Now we have a set of points (in function space),

=1, …,

We want to find a lower dimensional manifold that

captures as much variance as possible. If this were

standard PCA, we would minimize



(f



ij



(f



ij

with respect to

ij

 and v

3. Kernel PCA (which is a little bit different).

Remember,

(f



ij



(f



ij

is shorthand for



x (

(x) -



ij

(x)

) (

(x) -



ij

(x)

But we can mess with the norm to emphasize smoothness,



y (

(x) -



ij

(x)

-1

(x,y)

(y) -



ij

(y)

3. Kernel PCA (which is a little bit different).

and minimize



(f



ij



-1



(f



ij

with respect to

ij

 and v

If we set

, we get standard kernel PCA.

That’s the most convenient choice, because it makes

it easy to compute

ij

 and v

I don’t know if there are any other justifications.

, …

{f

, f

, …} = arg min

F(



, f



, …)

Most (almost all?) kernel problems are of the form

the f



-1



 are small

Summary

Specify

     the functional,

F(



, f



, …)

, to be minimized

     what one means by small (e.g.,



-1



),

and rest is (typically very straightforward) algebra.

[ F(



p) +





-1



f ] = 0

f = –





p)





The typical problem:

The solution (two lines of algebra)

There is no reason (I can find) to mention RKHSs

or mean embeddings.

All quantities one needs arise very naturally as the

solution to the problem one has proposed.

Slide Note

Embed Share

Download

Kernels are often confusing, but this talk aims to make them easy to understand. By focusing on intuition rather than complex equations, the speaker explains how kernels relate to linear algebra concepts. The talk covers the basic problem of minimizing a function with respect to a distribution and introduces the idea of regularization and kernels in a straightforward manner, making it accessible even to those new to the topic.

leonel Follow

Uploaded on Aug 30, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Kernels for dummies Tea talk September 22, 2014

I find kernel talks confusing. Equations seem to appear out of nowhere. It s hard for me to extract the main message. I still don t know what an RKHS is. Which is a little strange, considering I ve know about Hilbert spaces since before most of you were born. - - - -

So today were going to attempt to demystify kernels. My claim: if you understand linear algebra, you ll understand kernels. There will be no RKHSs in this talk. And no Polish spaces. And no mean embeddings. But also no proofs; basically I m going to avoid the hard stuff. Instead, just the intuition!

Well start with a pretty standard problem: You have samples from some distribution, p(x). You want to minimize - - F( dxf(x) p(x)) with respect to f(x). I ll give examples later, but just about every kernel talk you have ever heard considers this problem, or a slight generalization.

Our problem: ~ f(x) = arg min F( dxf(x) p(x)) f(x) ~ p(x) x ~ p

Our problem: ~ f(x) = arg min F( dxf(x) p(x)) f(x) ~ p(x) x ~ p 1 p(x) = i (x-xi) n

1 n dxf(x) i (x-xi) Our problem: ~ f(x) = arg min F( dxf(x) p(x)) f(x) ~ f(x) = + (x-x1) f(x) = (x-x1) => dxf(x) p(x) = + => dxf(x) p(x) = By suitably adjusting f(x), dxf(x) p(x) can range from to + . Therefore, we have to regularize f(x): we need a smoothness constraint.

If youre Bayesian, you put a prior over f(x). If you re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).

An aside: <f, g> = dxdyf(x) K-1(x, y) g(y).

If youre Bayesian, you put a prior over f(x). If you re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).

This raises two questions: 1. How do we make sense of K-1(x, y)? 2. What does dxdyf(x) K-1(x, y) f(y) have to do with smoothness?

1. Making sense of K-1(x, y): dyK-1(x, y) K(y, z) = (x z) defines K-1(x, y). Think of K as a matrix. K-1 is its inverse. K has an uncountably infinite number of indices. But otherwise it s a very standard matrix. K-1 exists if all the eigenvalues of K are positive. An aside: K-1doesn t really exist. But that s irrelevant.

2. What does dxdyf(x) K-1(x, y) f(y) have to do with smoothness? I ll answer for a specific case: translation invariant kernels, K(x, y) = K(x y).

dxdyf(x) K-1(x, y) f(y) = dxdyf(x) K-1(x y) f(y) |f(k)|2 K(k) translation invariance = dk Fourier transform Fourier transform of f(x) Fourier transform of K(x)

dxdyf(x) K-1(x, y) f(y) = dxdyf(x) K-1(x y) f(y) |f(k)|2 K(k) translation invariance = dk Fourier transform For smooth Kernels, K(k) falls off rapidly with k. For the above integral to be small, f(k) must fall off rapidly with k. In other words, f(k) must be smooth. - - -

dxdyf(x) K-1(x, y) f(y) = dxdyf(x) K-1(x y) f(y) |f(k)|2 K(k) dk |f(k)|2exp(+k2/2) = dk Example: K(x) = exp(-x2/2) K(k) exp(-k2/2) =>

More generally, dyK(x, y) gk(y) = (k) gk(x) => [ dxf(x) gk(x)]2 (k) dxdyf(x) K-1(x, y) f(y) = dk Typically, (k) falls of with k gk(x) become increasingly rough with k

Finally, lets link this to linear algebra dxf(x) g(x) dxf(x) A(x, y) f g f A(y) dxdyf(x) K-1(x, y) f(y) = f K-1 f => Compare to: ixiyi jAijxj ijxiAijxj = x A-1 x = x y = (A x)i -1 Integrals are glorified sums!

Our problem: ~ f = arg min F(f p) f f K-1 f is small ~ Two notions of small: d [ F(f p) + f K-1 f ] = 0 df 1. Lagrange multipliers: f K-1 f = constant. 2. fixed an aside: f K-1 f can often be thought of as coming from a prior.

d [ F(f p) + f K-1 f ] = 0 df is easy to solve: F (f p) p + 2 K-1 f = 0 F (f p) K p 2 => f = Remember: 1 p(x) = i (x-xi) n 1 K p(x) = iK(x, xi) n =>

The more general problem: ~ ~ {f1, f2, } = arg min F(f1 p1, f2 p2, ) f1, f2, ~ ~ the fi K-1 f i are small Almost all kernel related problems fall into this class. Those problems are fully specified by: the functional, F(f1 p1, f2 p2, ), to be minimized what one means by small (e.g., fi K-1 f i = ci) The rest is (typically very straightforward) algebra.

Three examples: 1. A witness function. 2. Ridge regression. 3. Kernel PCA (which is a little bit different).

1. A witness function. Maximize dxf(x) [p(x) - q(x)] [ ]2 [f (p-q)]2 subject to the constraint dxdyf(x) K-1(x, y) f(y) f K-1 f = 1 p and q are sums of delta functions.

1. A witness function. [f (p-q)]2 Maximize: f K-1 f = 1 subject to the constraint: Lagrange multipliers: d ( [f (p-q)]2 f K-1 f ) = 0 df K (p-q) [(p-q) K (p-q)]1/2 => f = [f (p-q)]2 = (p-q) K (p-q) =>

1. A witness function. p K p = dxdyp(x) K(x, y) p(y) 1 n 1 n j (x-xj) j (y-xj) 1 = ijK(xi, xj) n2 We didn t mention RKHSs We didn t mention mean embeddings All we did was linear algebra

1. A witness function. d ( [f (p-q)]2 f K-1 f ) = 0 df K (p-q) [(p-q) K (p-q)]1/2 => f = [f (p-q)]2 = (p-q) K (p-q) => ~50% of Arthur s Gatsby job talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.

2. Ridge regression. i (yi f pi)2 + f K-1 f minimize with respect to f. i labels observations the yiare observed (they re scalars) we have samples from the distributions pi(x) is fixed Ridge regression (with a kernel twist).

2. Ridge regression. solution (very straightforward algebra): f* = i i K pi -1 i = i (B + I)ij yi identity matrix Bij = pi K pj 1 = mnK(xm , xn) ni nj i j

2. Ridge regression. solution (very straightforward algebra): f* = i i K pi -1 i = i (B + I)ij yi We didn t mention RKHSs We didn t mention mean embeddings All we did was linear algebra

2. Ridge regression. minimize i (yi f pi)2 + f K-1 f with respect to f f* = i i K pi -1 i = i (B + I)ij yi ~50% of Zoltan s second to last research talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.

3. Kernel PCA (which is a little bit different). We have a set of points (in, for instance, Euclidean space), zi , i=1, , n. We want to project them into a higher dimensional space, and do PCA in that space. Why not go to the extreme, and project them into an infinite dimensional space? fi(x) = K(zi, x)

3. Kernel PCA (which is a little bit different). Now we have a set of points (in function space), fi , i=1, , n. We want to find a lower dimensional manifold that captures as much variance as possible. If this were standard PCA, we would minimize i (fi - jAijvj) (fi - jAijvj) with respect to Aij and vj .

3. Kernel PCA (which is a little bit different). Remember, (fi - kAij vk) (fi - kAij vk) is shorthand for dx (fi(x) - jAijvj(x)) (fi(x) - jAijvj(x)) But we can mess with the norm to emphasize smoothness, dxdy (fi(x) - jAijvj(x)) Q-1(x,y) (fi(y) - jAijvj(y))

3. Kernel PCA (which is a little bit different). and minimize i (fi - jAijvj) Q-1 (fi - jAijvj) with respect to Aij and vj . If we set Q = K, we get standard kernel PCA. That s the most convenient choice, because it makes it easy to compute Aij and vj . I don t know if there are any other justifications.

Summary Most (almost all?) kernel problems are of the form ~ ~ {f1, f2, } = arg min F(f1 p1, f2 p2, ) ~ ~ f1, f2, the fi K-1 f i are small Specify the functional, F(f1 p1, f2 p2, ), to be minimized what one means by small (e.g., fi K-1 f i = ci), and rest is (typically very straightforward) algebra.

The typical problem: d [ F(f p) + f K-1 f ] = 0 df The solution (two lines of algebra) f = F (f p) K p 2

There is no reason (I can find) to mention RKHSs or mean embeddings. All quantities one needs arise very naturally as the solution to the problem one has proposed.

Demystifying Kernels: A Simplified Approach without Complicated Math

Download Presentation

Presentation Transcript

Related

More Related Content