Kernels and Perceptrons: A Comprehensive Overview

 
KERNELS AND PERCEPTRONS
 
T
h
e
 
p
e
r
c
e
p
t
r
o
n
A
B
i
n
s
t
a
n
c
e
 
x
i
I
f
 
m
i
s
t
a
k
e
:
 
v
k
+
1
 
=
 
v
k
 
 
+
 
 
y
i
 
x
i
 
 
x 
is a vector
y 
is -1 or +1
Depends on how easy the learning
problem is, not dimension of vectors
x
Fairly intuitive:
“Similarity” of 
v
 to 
u 
looks like
(
v.u
)/|
v.v
|
(
v.u
) grows by >= γafter mistake
(
v.v
) grows by <= R
2
2
T
h
e
 
k
e
r
n
e
l
 
p
e
r
c
e
p
t
r
o
n
A
B
i
n
s
t
a
n
c
e
 
x
i
 
Mathematically the same as before … but allows use of the 
kernel
 
trick
3
T
h
e
 
k
e
r
n
e
l
 
p
e
r
c
e
p
t
r
o
n
A
B
i
n
s
t
a
n
c
e
 
x
i
Mathematically the same as before … but allows use of the 
kernel trick
 
O
t
h
e
r
 
k
e
r
n
e
l
 
m
e
t
h
o
d
s
 
(
S
V
M
,
 
G
a
u
s
s
i
a
n
 
p
r
o
c
e
s
s
e
s
)
 
a
r
e
n
t
 
c
o
n
s
t
r
a
i
n
e
d
 
t
o
l
i
m
i
t
e
d
 
s
e
t
 
(
+
1
/
-
1
/
0
)
 
o
f
 
 
w
e
i
g
h
t
s
 
o
n
 
t
h
e
 
K
(
x
,
v
)
 
v
a
l
u
e
s
.
4
 
S
o
m
e
 
c
o
m
m
o
n
 
k
e
r
n
e
l
s
 
Linear kernel:
 
Polynomial kernel:
 
Gaussian kernel:
 
More later….
 
5
 
K
e
r
n
e
l
s
 
1
0
1
 
Duality
and computational properties
Reproducing Kernel Hilbert Space (RKHS)
Gram matrix
Positive semi-definite
Closure properties
 
6
K
e
r
n
e
l
s
 
1
0
1
Duality: two ways to look at this
 
Two different computational ways of getting the same behavior
E
x
p
l
i
c
i
t
l
y
 
m
a
p
 
f
r
o
m
 
x
t
o
 
φ
(
x
)
 
 
i
.
e
.
 
t
o
 
t
h
e
p
o
i
n
t
 
c
o
r
r
e
s
p
o
n
d
i
n
g
 
t
o
x
 
i
n
 
t
h
e
 
H
i
l
b
e
r
t
 
s
p
a
c
e
I
m
p
l
i
c
i
t
l
y
 
m
a
p
 
f
r
o
m
 
x
t
o
 
φ
(
x
)
 
b
y
 
c
h
a
n
g
i
n
g
 
t
h
e
k
e
r
n
e
l
 
f
u
n
c
t
i
o
n
 
K
7
K
e
r
n
e
l
s
 
1
0
1
Duality
G
r
a
m
 
m
a
t
r
i
x
:
 
 
K
:
 
k
i
j
 
=
 
K
(
x
i
,
x
j
)
 
K(x,x
) = K(x
,x) 
 Gram
matrix is 
symmetric
8
 
K
e
r
n
e
l
s
 
1
0
1
 
Duality
G
r
a
m
 
m
a
t
r
i
x
:
 
 
K
:
 
k
i
j
 
=
 
K
(
x
i
,
x
j
)
 
K(x,x
) = K(x
,x) 
 Gram matrix is 
symmetric
 
K
 
i
s
 
p
o
s
i
t
i
v
e
 
s
e
m
i
-
d
e
f
i
n
i
t
e
 
 
 
 
z
T
 
K
 
z
 
>
=
 
0
f
o
r
 
a
l
l
 
z
 
F
u
n
 
f
a
c
t
:
 
G
r
a
m
 
m
a
t
r
i
x
 
p
o
s
i
t
i
v
e
 
s
e
m
i
-
d
e
f
i
n
i
t
e
 
K
(
x
i
,
x
j
)
=
φ
(
x
i
)
,
 
φ
(
x
j
)
 
f
o
r
 
s
o
m
e
 
φ
P
r
o
o
f
:
 
φ
(
x
)
 
u
s
e
s
 
t
h
e
 
e
i
g
e
n
v
e
c
t
o
r
s
 
o
f
 
K
 
t
o
 
r
e
p
r
e
s
e
n
t
 
x
 
9
 
HASH KERNELS AND
“THE HASH TRICK”
 
 
10
Question
 
Most of the weights in a classifier are
small and not important
 
So we can use the “hash trick”
11
 
12
 
The hash trick as a kernel
 
Usually we 
implicitly
 map from x to φ(x)
All computations of learner are in terms of
K(x,x’) =  <φ(x), φ(x’) >
Because φ(x) is large
 
In this case we 
explicitly
 map from x to φ(x)
φ(x) is small
 
Some details
 
Slightly different hash to avoid systematic
bias
 
m 
is the number of buckets you hash into (R in my
discussion)
 
14
 
Some details
 
Slightly different hash to avoid systematic
bias
 
m 
is the number of buckets you hash into (R in my
discussion)
 
15
 
Some details
 
I.e. – a hashed vector is probably close
to the original vector
 
16
 
Some details
 
I.e. the inner products between x and x’ are
probably not changed too much by the hash
function: a classifier will probably still work
 
17
 
Some details
 
18
 
The hash kernel: implementation
 
One problem: debugging is harder
Features are no longer meaningful
There’s a new way to ruin a classifier
Change the hash function 
You can separately compute the set of all
words that hash to 
h
 
and guess what features
mean
Build an inverted index 
h
w1,w2,…,
 
19
 
ADAPTIVE GRADIENTS
 
Motivation
What’s the best learning
rate?
If a feature is rare, but
relevant, it should be
high, else learning will
be slow
Regularization makes
this better/worse?
But then you could
overshoot the local
minima when you train
Motivation
What’s the best learning
rate?
Depends on typical
gradient for a feature
Small 
 fast rate
Large 
 slow rate
Sadly we can’t afford
to ignore rare features
We could have a lot of
them
 
Motivation
 
What’s the best learning
rate?
Let’s pretend our
observed gradients are
from a zero-mean
Gaussian and find
variances, then scale
dimension
 j 
by sd
(j
)
-1
 
Motivation
 
What’s the best learning
rate?
Let’s pretend our
observed gradients are
from a zero-mean
Gaussian and find
variances, then scale
dimension
 j 
by sd
(j
)
-1
Ignore co-variances for
efficiency
 
Motivation
 
What’s the best learning
rate?
Let’s pretend our
observed gradients are
from a zero-mean
Gaussian and find
variances, then scale
dimension
 j 
by sd
(j
)
-1
Ignore co-variances for
efficiency
 
Adagrad
Gradient at
time τ-
covariances
η= 1 is
usually ok
 
ALL-REDUCE
 
 
Introduction
 
Common pattern:
do some learning in parallel
aggregate local changes from each processor
to shared parameters
distribute the new shared parameters
back to each processor
 
 
and repeat….
 
AllReduce implemented in MPI, recently in VW code (John
Langford) in a Hadoop/compatible scheme
MAP
ALLREDUCE
 
Gory details of VW Hadoop-
AllReduce
 
Spanning-tree server:
Separate process constructs a spanning tree
of the 
compute
 
nodes in the cluster
 and then
acts as a server
Worker nodes (“fake” mappers):
Input for worker is locally cached
Workers all connect to spanning-tree server
Workers all execute the same code, which
might contain AllReduce calls:
Workers 
synchronize
 whenever they reach an all-
reduce
Hadoop AllReduce
don’t wait for duplicate jobs
Second-order method - like Newton’s
method
 
2 
24
 features
 
~=100 non-
zeros/example
 
2.3B examples
 
example is
user/page/ad and
conjunctions of these,
positive if there was a
click-thru on the ad
 
50M examples
 
explicitly constructed kernel 
 11.7M features
 
3,300 nonzeros/example
 
old method: SVM, 3 days:   reporting time to get to fixed test error
Slide Note
Embed
Share

Kernels and Perceptrons are fundamental concepts in machine learning. This overview covers the Perceptron algorithm, Kernel Perceptron, and Common Kernels, along with Duality and Computational properties. It also explores mapping to Hilbert space and the computational approaches for achieving desired behavior.

  • Machine Learning
  • Kernels
  • Perceptrons
  • Algorithms
  • Computational Properties

Uploaded on Oct 02, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. KERNELS AND PERCEPTRONS

  2. The perceptron x is a vector y is -1 or +1 ^ instancexi Compute: yi = sign(vk . xi) B A If mistake: vk+1 = vk + yixi ^ yi 2 yi R g mistake bound: k u Depends on how easy the learning problem is, not dimension of vectors x +x2 v2 > Fairly intuitive: Similarity of v to u looks like (v.u)/|v.v| (v.u) grows by >= after mistake (v.v) grows by <= R2 v1 -u 2 2

  3. The kernel perceptron instancexi B A vk= xk+- .xk- ^ yi xk+ FN xk- FP yi ^ Compute: y=sign( .xk+- .xk-) xi xi Compute: yi = sign(vk . xi ) xk+ FN xk- FP x If false positive (too high) mistake add : to FP If mistake: vk+1 = vk + yixi i x If false positive (too low) mistake add : to FN i Mathematically the same as before but allows use of the kerneltrick 3

  4. The kernel perceptron x x x x ( , ) K instancexi B k k A ^ y=sign( ,xk+)- K(xi K(xi ,xk-) yi xk+ FN xk- FP yi k k ^ = x x . x x . : Compute y Compute: yi = vk . xi + i i k k x x FN FP + x If false positive (too high) mistake add : to FP If mistake: vk+1 = vk + yixi i x If false positive (too low) mistake add : to FN i Mathematically the same as before but allows use of the kernel trick Other kernel methods (SVM, Gaussian processes) aren t constrained to limited set (+1/-1/0) of weights on the K(x,v) values. 4

  5. Some common kernels x ) ' x x x ( , ' K Linear kernel: Polynomial kernel: ) 1 + d x ) ' x x x ( , ( ' K Gaussian kernel: 2 / e x x || '|| x ) ' x ( , K More later . 5

  6. Kernels 101 Duality and computational properties Reproducing Kernel Hilbert Space (RKHS) Gram matrix Positive semi-definite Closure properties 6

  7. Implicitly map from x to (x) by changing the kernel function K Explicitly map from x to (x) i.e. to the point corresponding to x in the Hilbert space Kernels 101 Duality: two ways to look at this k K k ( x y=x w + FN k = x x x x y ( , ) ( , ) K K + i i k k x x FN FP + k = + w x x x x x ( , ) ) ( ) k k x x FP k k = + k x ) w y ( k = + ) w x x ( ( ) k k x x FN FP Two different computational ways of getting the same behavior 7

  8. Kernels 101 K(x,x ) = K(x ,x) Gram matrix is symmetric Duality Gram matrix: K: kij = K(xi,xj) 8

  9. Kernels 101 Duality Gram matrix: K: kij = K(xi,xj) K(x,x ) = K(x ,x) Gram matrix is symmetric K is positive semi-definite zT Kz >= 0 for all z Fun fact: Gram matrixpositive semi-definite K(xi,xj)= (xi), (xj) for some Proof: (x) uses the eigenvectors of K to represent x 9

  10. HASH KERNELS AND THE HASH TRICK 10

  11. Question Most of the weights in a classifier are small and not important So we can use the hash trick 11

  12. 12

  13. The hash trick as a kernel Usually we implicitly map from x to (x) All computations of learner are in terms of K(x,x ) = < (x), (x ) > Because (x) is large In this case we explicitly map from x to (x) (x) is small

  14. Some details Slightly different hash to avoid systematic bias V[h]= j:hash( j)%R==h j xi { } j[h]= x(j)xi , where x(j) -1,+1 j j:hash(j)%m==h m is the number of buckets you hash into (R in my discussion) 14

  15. Some details Slightly different hash to avoid systematic bias { } j[h]= x(j)xi , where x(j) -1,+1 j j:hash(j)%m==h m is the number of buckets you hash into (R in my discussion) 15

  16. Some details I.e. a hashed vector is probably close to the original vector 16

  17. Some details I.e. the inner products between x and x are probably not changed too much by the hash function: a classifier will probably still work17

  18. Some details 18

  19. The hash kernel: implementation One problem: debugging is harder Features are no longer meaningful There s a new way to ruin a classifier Change the hash function You can separately compute the set of all words that hash to hand guess what features mean Build an inverted index h w1,w2, , 19

  20. ADAPTIVE GRADIENTS

  21. Motivation What s the best learning rate? If a feature is rare, but relevant, it should be high, else learning will be slow Regularization makes this better/worse? But then you could overshoot the local minima when you train

  22. Motivation What s the best learning rate? Depends on typical gradient for a feature Small fast rate Large slow rate Sadly we can t afford to ignore rare features We could have a lot of them

  23. Motivation What s the best learning rate? Let s pretend our observed gradients are from a zero-mean Gaussian and find variances, then scale dimension j by sd(j)-1

  24. Motivation What s the best learning rate? Let s pretend our observed gradients are from a zero-mean Gaussian and find variances, then scale dimension j by sd(j)-1 Ignore co-variances for efficiency

  25. Motivation What s the best learning rate? Let s pretend our observed gradients are from a zero-mean Gaussian and find variances, then scale dimension j by sd(j)-1 Ignore co-variances for efficiency

  26. Adagrad Gradient at time - covariances hgj t := wj- = 1 is usually ok 2 gt,j

  27. ALL-REDUCE

  28. Introduction Common pattern: do some learning in parallel aggregate local changes from each processor to shared parameters distribute the new shared parameters back to each processor MAP ALLREDUCE and repeat . AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme

  29. Gory details of VW Hadoop- AllReduce Spanning-tree server: Separate process constructs a spanning tree of the computenodes in the cluster and then acts as a server Worker nodes ( fake mappers): Input for worker is locally cached Workers all connect to spanning-tree server Workers all execute the same code, which might contain AllReduce calls: Workers synchronize whenever they reach an all- reduce

  30. Hadoop AllReduce don t wait for duplicate jobs

  31. Second-order method - like Newtons method

  32. 2 24 features ~=100 non- zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

  33. 50M examples explicitly constructed kernel 11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#