Demystifying Kernels: A Simplified Approach without Complicated Math

Slide Note
Embed
Share

Kernels are often confusing, but this talk aims to make them easy to understand. By focusing on intuition rather than complex equations, the speaker explains how kernels relate to linear algebra concepts. The talk covers the basic problem of minimizing a function with respect to a distribution and introduces the idea of regularization and kernels in a straightforward manner, making it accessible even to those new to the topic.


Uploaded on Aug 30, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Kernels for dummies Tea talk September 22, 2014

  2. I find kernel talks confusing. Equations seem to appear out of nowhere. It s hard for me to extract the main message. I still don t know what an RKHS is. Which is a little strange, considering I ve know about Hilbert spaces since before most of you were born. - - - -

  3. So today were going to attempt to demystify kernels. My claim: if you understand linear algebra, you ll understand kernels. There will be no RKHSs in this talk. And no Polish spaces. And no mean embeddings. But also no proofs; basically I m going to avoid the hard stuff. Instead, just the intuition!

  4. Well start with a pretty standard problem: You have samples from some distribution, p(x). You want to minimize - - F( dxf(x) p(x)) with respect to f(x). I ll give examples later, but just about every kernel talk you have ever heard considers this problem, or a slight generalization.

  5. Our problem: ~ f(x) = arg min F( dxf(x) p(x)) f(x) ~ p(x) x ~ p

  6. Our problem: ~ f(x) = arg min F( dxf(x) p(x)) f(x) ~ p(x) x ~ p 1 p(x) = i (x-xi) n

  7. 1 n dxf(x) i (x-xi) Our problem: ~ f(x) = arg min F( dxf(x) p(x)) f(x) ~ f(x) = + (x-x1) f(x) = (x-x1) => dxf(x) p(x) = + => dxf(x) p(x) = By suitably adjusting f(x), dxf(x) p(x) can range from to + . Therefore, we have to regularize f(x): we need a smoothness constraint.

  8. If youre Bayesian, you put a prior over f(x). If you re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).

  9. An aside: <f, g> = dxdyf(x) K-1(x, y) g(y).

  10. If youre Bayesian, you put a prior over f(x). If you re a kernel person you demand that dxdyf(x) K-1(x, y) f(y) is in some sense small. K(x, y) is a Kernel. For example, K(x, y) = exp(-(x-y)2/2).

  11. This raises two questions: 1. How do we make sense of K-1(x, y)? 2. What does dxdyf(x) K-1(x, y) f(y) have to do with smoothness?

  12. 1. Making sense of K-1(x, y): dyK-1(x, y) K(y, z) = (x z) defines K-1(x, y). Think of K as a matrix. K-1 is its inverse. K has an uncountably infinite number of indices. But otherwise it s a very standard matrix. K-1 exists if all the eigenvalues of K are positive. An aside: K-1doesn t really exist. But that s irrelevant.

  13. 2. What does dxdyf(x) K-1(x, y) f(y) have to do with smoothness? I ll answer for a specific case: translation invariant kernels, K(x, y) = K(x y).

  14. dxdyf(x) K-1(x, y) f(y) = dxdyf(x) K-1(x y) f(y) |f(k)|2 K(k) translation invariance = dk Fourier transform Fourier transform of f(x) Fourier transform of K(x)

  15. dxdyf(x) K-1(x, y) f(y) = dxdyf(x) K-1(x y) f(y) |f(k)|2 K(k) translation invariance = dk Fourier transform For smooth Kernels, K(k) falls off rapidly with k. For the above integral to be small, f(k) must fall off rapidly with k. In other words, f(k) must be smooth. - - -

  16. dxdyf(x) K-1(x, y) f(y) = dxdyf(x) K-1(x y) f(y) |f(k)|2 K(k) dk |f(k)|2exp(+k2/2) = dk Example: K(x) = exp(-x2/2) K(k) exp(-k2/2) =>

  17. More generally, dyK(x, y) gk(y) = (k) gk(x) => [ dxf(x) gk(x)]2 (k) dxdyf(x) K-1(x, y) f(y) = dk Typically, (k) falls of with k gk(x) become increasingly rough with k

  18. Finally, lets link this to linear algebra dxf(x) g(x) dxf(x) A(x, y) f g f A(y) dxdyf(x) K-1(x, y) f(y) = f K-1 f => Compare to: ixiyi jAijxj ijxiAijxj = x A-1 x = x y = (A x)i -1 Integrals are glorified sums!

  19. Our problem: ~ f = arg min F(f p) f f K-1 f is small ~ Two notions of small: d [ F(f p) + f K-1 f ] = 0 df 1. Lagrange multipliers: f K-1 f = constant. 2. fixed an aside: f K-1 f can often be thought of as coming from a prior.

  20. d [ F(f p) + f K-1 f ] = 0 df is easy to solve: F (f p) p + 2 K-1 f = 0 F (f p) K p 2 => f = Remember: 1 p(x) = i (x-xi) n 1 K p(x) = iK(x, xi) n =>

  21. The more general problem: ~ ~ {f1, f2, } = arg min F(f1 p1, f2 p2, ) f1, f2, ~ ~ the fi K-1 f i are small Almost all kernel related problems fall into this class. Those problems are fully specified by: the functional, F(f1 p1, f2 p2, ), to be minimized what one means by small (e.g., fi K-1 f i = ci) The rest is (typically very straightforward) algebra.

  22. Three examples: 1. A witness function. 2. Ridge regression. 3. Kernel PCA (which is a little bit different).

  23. 1. A witness function. Maximize dxf(x) [p(x) - q(x)] [ ]2 [f (p-q)]2 subject to the constraint dxdyf(x) K-1(x, y) f(y) f K-1 f = 1 p and q are sums of delta functions.

  24. 1. A witness function. [f (p-q)]2 Maximize: f K-1 f = 1 subject to the constraint: Lagrange multipliers: d ( [f (p-q)]2 f K-1 f ) = 0 df K (p-q) [(p-q) K (p-q)]1/2 => f = [f (p-q)]2 = (p-q) K (p-q) =>

  25. 1. A witness function. p K p = dxdyp(x) K(x, y) p(y) 1 n 1 n j (x-xj) j (y-xj) 1 = ijK(xi, xj) n2 We didn t mention RKHSs We didn t mention mean embeddings All we did was linear algebra

  26. 1. A witness function. d ( [f (p-q)]2 f K-1 f ) = 0 df K (p-q) [(p-q) K (p-q)]1/2 => f = [f (p-q)]2 = (p-q) K (p-q) => ~50% of Arthur s Gatsby job talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.

  27. 2. Ridge regression. i (yi f pi)2 + f K-1 f minimize with respect to f. i labels observations the yiare observed (they re scalars) we have samples from the distributions pi(x) is fixed Ridge regression (with a kernel twist).

  28. 2. Ridge regression. solution (very straightforward algebra): f* = i i K pi -1 i = i (B + I)ij yi identity matrix Bij = pi K pj 1 = mnK(xm , xn) ni nj i j

  29. 2. Ridge regression. solution (very straightforward algebra): f* = i i K pi -1 i = i (B + I)ij yi We didn t mention RKHSs We didn t mention mean embeddings All we did was linear algebra

  30. 2. Ridge regression. minimize i (yi f pi)2 + f K-1 f with respect to f f* = i i K pi -1 i = i (B + I)ij yi ~50% of Zoltan s second to last research talk. I do not mean to trivialize the work of kernel people. But I do want to point out that the setup is almost always straightforward.

  31. 3. Kernel PCA (which is a little bit different). We have a set of points (in, for instance, Euclidean space), zi , i=1, , n. We want to project them into a higher dimensional space, and do PCA in that space. Why not go to the extreme, and project them into an infinite dimensional space? fi(x) = K(zi, x)

  32. 3. Kernel PCA (which is a little bit different). Now we have a set of points (in function space), fi , i=1, , n. We want to find a lower dimensional manifold that captures as much variance as possible. If this were standard PCA, we would minimize i (fi - jAijvj) (fi - jAijvj) with respect to Aij and vj .

  33. 3. Kernel PCA (which is a little bit different). Remember, (fi - kAij vk) (fi - kAij vk) is shorthand for dx (fi(x) - jAijvj(x)) (fi(x) - jAijvj(x)) But we can mess with the norm to emphasize smoothness, dxdy (fi(x) - jAijvj(x)) Q-1(x,y) (fi(y) - jAijvj(y))

  34. 3. Kernel PCA (which is a little bit different). and minimize i (fi - jAijvj) Q-1 (fi - jAijvj) with respect to Aij and vj . If we set Q = K, we get standard kernel PCA. That s the most convenient choice, because it makes it easy to compute Aij and vj . I don t know if there are any other justifications.

  35. Summary Most (almost all?) kernel problems are of the form ~ ~ {f1, f2, } = arg min F(f1 p1, f2 p2, ) ~ ~ f1, f2, the fi K-1 f i are small Specify the functional, F(f1 p1, f2 p2, ), to be minimized what one means by small (e.g., fi K-1 f i = ci), and rest is (typically very straightforward) algebra.

  36. The typical problem: d [ F(f p) + f K-1 f ] = 0 df The solution (two lines of algebra) f = F (f p) K p 2

  37. There is no reason (I can find) to mention RKHSs or mean embeddings. All quantities one needs arise very naturally as the solution to the problem one has proposed.

Related


More Related Content