Kernels and Perceptrons: A Comprehensive Overview

KERNELS AND PERCEPTRONS

is a vector

is -1 or +1

Depends on how easy the learning

problem is, not dimension of vectors

Fairly intuitive:

•

“Similarity” of

to

looks like

v.u

)/|

v.v

•

v.u

) grows by >= γafter mistake

•

v.v

) grows by <= R

Mathematically the same as before … but allows use of the

kernel

trick

Mathematically the same as before … but allows use of the

“

kernel trick

”

’

•

Linear kernel:

•

Polynomial kernel:

•

Gaussian kernel:

•

More later….

•

Duality

–

and computational properties

–

Reproducing Kernel Hilbert Space (RKHS)

•

Gram matrix

•

Positive semi-definite

•

Closure properties

•

Duality: two ways to look at this

Two different computational ways of getting the same behavior

φ

–

φ

•

Duality

•

K(x,x

’

) = K(x

’

,x)



 Gram

matrix is

symmetric

•

Duality

•

K(x,x

’

) = K(x

’

,x)



 Gram matrix is

symmetric

“

”



…





φ

φ

φ

φ

HASH KERNELS AND

“THE HASH TRICK”

Question

•

Most of the weights in a classifier are

–

small and not important

–

So we can use the “hash trick”

The hash trick as a kernel

•

Usually we

implicitly

 map from x to φ(x)

–

All computations of learner are in terms of

K(x,x’) =  <φ(x), φ(x’) >

–

Because φ(x) is large

•

In this case we

explicitly

 map from x to φ(x)

–

φ(x) is small

Some details

Slightly different hash to avoid systematic

bias

is the number of buckets you hash into (R in my

discussion)

Some details

Slightly different hash to avoid systematic

bias

is the number of buckets you hash into (R in my

discussion)

Some details

I.e. – a hashed vector is probably close

to the original vector

Some details

I.e. the inner products between x and x’ are

probably not changed too much by the hash

function: a classifier will probably still work

Some details

The hash kernel: implementation

•

One problem: debugging is harder

–

Features are no longer meaningful

–

There’s a new way to ruin a classifier

•

Change the hash function



•

You can separately compute the set of all

words that hash to

and guess what features

mean

–

Build an inverted index



w1,w2,…,

ADAPTIVE GRADIENTS

Motivation

•

What’s the best learning

rate?

–

If a feature is rare, but

relevant, it should be

high, else learning will

be slow

•

Regularization makes

this better/worse?

–

But then you could

overshoot the local

minima when you train

Motivation

•

What’s the best learning

rate?

–

Depends on typical

gradient for a feature

•

Small



 fast rate

•

Large



 slow rate

–

Sadly we can’t afford

to ignore rare features

•

We could have a lot of

them

Motivation

•

What’s the best learning

rate?

–

Let’s pretend our

observed gradients are

from a zero-mean

Gaussian and find

variances, then scale

dimension

by sd

(j

-1

Motivation

•

What’s the best learning

rate?

–

Let’s pretend our

observed gradients are

from a zero-mean

Gaussian and find

variances, then scale

dimension

by sd

(j

-1

–

Ignore co-variances for

efficiency

Motivation

•

What’s the best learning

rate?

–

Let’s pretend our

observed gradients are

from a zero-mean

Gaussian and find

variances, then scale

dimension

by sd

(j

-1

–

Ignore co-variances for

efficiency

Adagrad

Gradient at

time τ-

covariances

η= 1 is

usually ok

ALL-REDUCE

Introduction

•

Common pattern:

–

do some learning in parallel

–

aggregate local changes from each processor

•

to shared parameters

–

distribute the new shared parameters

•

back to each processor

–

and repeat….

•

AllReduce implemented in MPI, recently in VW code (John

Langford) in a Hadoop/compatible scheme

MAP

ALLREDUCE

Gory details of VW Hadoop-

AllReduce

•

Spanning-tree server:

–

Separate process constructs a spanning tree

of the

compute

nodes in the cluster

 and then

acts as a server

•

Worker nodes (“fake” mappers):

–

Input for worker is locally cached

–

Workers all connect to spanning-tree server

–

Workers all execute the same code, which

might contain AllReduce calls:

•

Workers

synchronize

 whenever they reach an all-

reduce

Hadoop AllReduce

don’t wait for duplicate jobs

Second-order method - like Newton’s

method

 features

~=100 non-

zeros/example

2.3B examples

example is

user/page/ad and

conjunctions of these,

positive if there was a

click-thru on the ad

50M examples

explicitly constructed kernel



 11.7M features

3,300 nonzeros/example

old method: SVM, 3 days:   reporting time to get to fixed test error

Slide Note

Embed Share

Download

Kernels and Perceptrons are fundamental concepts in machine learning. This overview covers the Perceptron algorithm, Kernel Perceptron, and Common Kernels, along with Duality and Computational properties. It also explores mapping to Hilbert space and the computational approaches for achieving desired behavior.

avinc Follow

Uploaded on Oct 02, 2024 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

KERNELS AND PERCEPTRONS

The perceptron x is a vector y is -1 or +1 ^ instancexi Compute: yi = sign(vk . xi) B A If mistake: vk+1 = vk + yixi ^ yi 2 yi R g mistake bound: k u Depends on how easy the learning problem is, not dimension of vectors x +x2 v2 > Fairly intuitive: Similarity of v to u looks like (v.u)/|v.v| (v.u) grows by >= after mistake (v.v) grows by <= R2 v1 -u 2 2

The kernel perceptron instancexi B A vk= xk+- .xk- ^ yi xk+ FN xk- FP yi ^ Compute: y=sign( .xk+- .xk-) xi xi Compute: yi = sign(vk . xi ) xk+ FN xk- FP x If false positive (too high) mistake add : to FP If mistake: vk+1 = vk + yixi i x If false positive (too low) mistake add : to FN i Mathematically the same as before but allows use of the kerneltrick 3

The kernel perceptron x x x x ( , ) K instancexi B k k A ^ y=sign( ,xk+)- K(xi K(xi ,xk-) yi xk+ FN xk- FP yi k k ^ = x x . x x . : Compute y Compute: yi = vk . xi + i i k k x x FN FP + x If false positive (too high) mistake add : to FP If mistake: vk+1 = vk + yixi i x If false positive (too low) mistake add : to FN i Mathematically the same as before but allows use of the kernel trick Other kernel methods (SVM, Gaussian processes) aren t constrained to limited set (+1/-1/0) of weights on the K(x,v) values. 4

Some common kernels x ) ' x x x ( , ' K Linear kernel: Polynomial kernel: ) 1 + d x ) ' x x x ( , ( ' K Gaussian kernel: 2 / e x x || '|| x ) ' x ( , K More later . 5

Kernels 101 Duality and computational properties Reproducing Kernel Hilbert Space (RKHS) Gram matrix Positive semi-definite Closure properties 6

Implicitly map from x to (x) by changing the kernel function K Explicitly map from x to (x) i.e. to the point corresponding to x in the Hilbert space Kernels 101 Duality: two ways to look at this k K k ( x y=x w + FN k = x x x x y ( , ) ( , ) K K + i i k k x x FN FP + k = + w x x x x x ( , ) ) ( ) k k x x FP k k = + k x ) w y ( k = + ) w x x ( ( ) k k x x FN FP Two different computational ways of getting the same behavior 7

Kernels 101 K(x,x ) = K(x ,x) Gram matrix is symmetric Duality Gram matrix: K: kij = K(xi,xj) 8

Kernels 101 Duality Gram matrix: K: kij = K(xi,xj) K(x,x ) = K(x ,x) Gram matrix is symmetric K is positive semi-definite zT Kz >= 0 for all z Fun fact: Gram matrixpositive semi-definite K(xi,xj)= (xi), (xj) for some Proof: (x) uses the eigenvectors of K to represent x 9

HASH KERNELS AND THE HASH TRICK 10

Question Most of the weights in a classifier are small and not important So we can use the hash trick 11

The hash trick as a kernel Usually we implicitly map from x to (x) All computations of learner are in terms of K(x,x ) = < (x), (x ) > Because (x) is large In this case we explicitly map from x to (x) (x) is small

Some details Slightly different hash to avoid systematic bias V[h]= j:hash( j)%R==h j xi { } j[h]= x(j)xi , where x(j) -1,+1 j j:hash(j)%m==h m is the number of buckets you hash into (R in my discussion) 14

Some details Slightly different hash to avoid systematic bias { } j[h]= x(j)xi , where x(j) -1,+1 j j:hash(j)%m==h m is the number of buckets you hash into (R in my discussion) 15

Some details I.e. a hashed vector is probably close to the original vector 16

Some details I.e. the inner products between x and x are probably not changed too much by the hash function: a classifier will probably still work17

Some details 18

The hash kernel: implementation One problem: debugging is harder Features are no longer meaningful There s a new way to ruin a classifier Change the hash function You can separately compute the set of all words that hash to hand guess what features mean Build an inverted index h w1,w2, , 19

ADAPTIVE GRADIENTS

Motivation What s the best learning rate? If a feature is rare, but relevant, it should be high, else learning will be slow Regularization makes this better/worse? But then you could overshoot the local minima when you train

Motivation What s the best learning rate? Depends on typical gradient for a feature Small fast rate Large slow rate Sadly we can t afford to ignore rare features We could have a lot of them

Motivation What s the best learning rate? Let s pretend our observed gradients are from a zero-mean Gaussian and find variances, then scale dimension j by sd(j)-1

Motivation What s the best learning rate? Let s pretend our observed gradients are from a zero-mean Gaussian and find variances, then scale dimension j by sd(j)-1 Ignore co-variances for efficiency

Motivation What s the best learning rate? Let s pretend our observed gradients are from a zero-mean Gaussian and find variances, then scale dimension j by sd(j)-1 Ignore co-variances for efficiency

Adagrad Gradient at time - covariances hgj t := wj- = 1 is usually ok 2 gt,j

ALL-REDUCE

Introduction Common pattern: do some learning in parallel aggregate local changes from each processor to shared parameters distribute the new shared parameters back to each processor MAP ALLREDUCE and repeat . AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme

Gory details of VW Hadoop- AllReduce Spanning-tree server: Separate process constructs a spanning tree of the computenodes in the cluster and then acts as a server Worker nodes ( fake mappers): Input for worker is locally cached Workers all connect to spanning-tree server Workers all execute the same code, which might contain AllReduce calls: Workers synchronize whenever they reach an all- reduce

Hadoop AllReduce don t wait for duplicate jobs

Second-order method - like Newtons method

2 24 features ~=100 non- zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

50M examples explicitly constructed kernel 11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error

Kernels and Perceptrons: A Comprehensive Overview

Download Presentation

Presentation Transcript

Related

More Related Content