Linear SVMs for Binary Classification

 
Support Vector Machines
Kernels
 
based on notes by Andrew Ng
 and Others
Problem definition:
 
-We are given a set of  n points (vectors) :
                           such that       is a vector of length m ,
    and each belong to one of two classes we label them
    by “+1” and “-1”.
-So our training set is:
 
 
 
- We want to find a separating hyperplane
that separates these points into the two classes.
The positives
” (class “+1”) and “
The negatives
” (class “-1”).
(Assuming that they are linearly separable)
 
 
 
 
 
 
 
 
S
o
 
t
h
e
 
d
e
c
i
s
i
o
n
f
u
n
c
t
i
o
n
 
w
i
l
l
 
b
e
Linear Separators
Which of the linear separators is optimal?
Classification Margin
Distance from example 
x
i
 to the separator is
Examples closest to the hyperplane are 
support vectors
.
Margin
 
ρ
 
of the separator is the distance between support vectors.
r
ρ
Maximum Margin Classification
Maximizing the margin is good according to intuition and
PAC theory.
Implies that only support vectors matter; other training
examples are ignorable.
 
Linear SVM Mathematically
 
Let training set {(
x
i
, 
y
i
)}
i
=1..
n
, 
x
i
R
d
, 
y
i
 
 
{-1, 1}
 
be separated by a
hyperplane with
 
margin 
ρ
. Then for each training example
 (
x
i
, 
y
i
):
 
 
 
For every support vector 
x
s
 the above inequality is an equality.
After rescaling 
w
 and 
b
 by 
ρ
/
2
 
in the equality
, we obtain that
distance between each 
x
s 
and the hyperplane is
 
Then the margin can be expressed through (rescaled) 
w
 and b as:
 
w
T
x
i
 
+ 
b
 ≤ - 
ρ
/2 
   if 
y
i
 
= -1
w
T
x
i
 
+ 
b
 
ρ
/2
    if 
y
i
 
= 1
 
y
i
(
w
T
x
i
 
+ 
b
)
 
  
ρ
/2
 
 
Linear SVMs Mathematically (cont.)
 
Then we can formulate the 
quadratic optimization problem:
 
 
 
 
 
Which can be reformulated as:
Find 
w
 and 
b
 such that
                is maximized
and for all (
x
i
, 
y
i
), 
i
=1..
n
 :     
y
i
(
w
T
x
i
 
+ 
b)
 
1
Find 
w
 and 
b
 such that
Φ
(
w
)
 = ||w||
2
=
w
T
w
  is minimized
and for all (
x
i
, 
y
i
), 
i
=1..
n
 :    
y
i
 (
w
T
x
i
 
+ 
b
)
 
1
 
 
 
 
 
-
Our optimization problem so far:
 
 
-We will solve this problem by introducing 
Lagrange
multipliers
     
associated with the constrains:
 
I do remember the
 
Lagrange Multipliers
from Calculus!
SVM : Linear separable case.
The optimization problem:
 
14
 
c
ons
t
r
a
i
n
e
d
 
op
t
i
m
i
z
a
t
i
on
 
gi
v
e
n
 
c
onv
e
x
 
f
unc
t
i
ons
 
f
,
 
g
1
,
 
g
2
,
 
....
,
 
g
k
,
 
h
1
,
 
h
2
,
 
....
,
 
h
m
t
he
 
 
p
r
obl
e
m
 
on
 
c
onv
e
x
 
s
e
t
 
X
,
 
m
i
ni
m
i
z
e
 
 
f
 
(
w
)
s
ubj
e
c
t 
 
t
o
 
 
g
i
(
w
)
 
 
0
 
 
,
f
o
r
 
 
a
ll
i
h
j
 
(
w
)
 
=
 
0
 
 
,
f
o
r
 
 
a
ll
 
 
j
 
h
a
s
 
a
s
 
i
t
s
 
s
ol
u
t
i
on
 
a
 
c
onv
e
x
i
s
 
 
uni
que
 
 
(
i
f
 
 
e
x
i
s
t
s
)
 
s
e
t
.
 
I
f
 
f
 
i
s
 
s
t
r
i
c
t
 
c
onv
e
x
 
t
he
 
s
ol
u
t
i
on
 
w
e
 
 
w
ill
 
 
a
ss
u
m
e
 
 
a
ll
 
 
t
he
 
 
g
oo
d
 
 
t
hi
ngs
 
 
one
 
 
c
a
n
 
 
i
m
a
gi
ne
 
 
a
bout
f
unc
t
i
ons
 
f
,
 
g
1
,
 
g
2
,
 
....
,
 
g
k
,
 
h
1
,
 
h
2
,
 
....
,
 
h
m
 
 
li
k
e
 
c
onv
e
x
i
t
y
,
 
di
ff
e
r
e
n
t
i
a
bi
li
t
y
e
t
c
.
T
h
a
t
 
w
ill
 
s
t
ill
 
 
not
 
 
b
e
 
 
e
nough
 
 
t
hough
.
..
.
 
15
 
e
qu
a
li
t
y
 
c
ons
t
r
a
i
n
t
s
 
on
l
y
 
m
i
ni
m
i
z
e
 
 
f
 
(
w
)
s
ubj
e
c
t 
 
t
o
 
 
h
j
 
(
w
)
 
=
 
0
 
 
,
f
o
r
 
 
a
ll
 
 
j
 
de
fi
ne 
 
t
he
 
 
l
a
gr
a
ngi
a
n
 
 
f
unc
t
i
on
L
(
w
,
 
β
)
 
=
 
f
 
(
w
)
 
+
 
!
 
β
j
h
j
 
(
w
)
j
L
a
g
r
a
ng
e 
 
t
h
e
o
r
e
m
 
 
ne
c
[
e
ss
a
r
y
]
 
 
a
nd
 
 
s
uff
[
i
c
i
e
n
t
]
 
 
c
ondi
t
i
ons
 
 
f
o
r
 
 
a
p
oi
nt
 
"
e
x
i
s
t
e
nc
e
 
 
of
 
 
β
"
 
 
s
uc
h
t
h
a
t
 
w
 
t
o
 
b
e
 
a
n
 
op
t
i
m
um
 
(
i
e
 
a
 
s
ol
u
t
i
on
 
f
o
r
 
t
he
 
p
r
obl
e
m
 
a
b
ov
e
)
 
a
r
e
 
t
he
 
δ
w
L
(
 
"
 
w
,
 
β
)
 
=
 
0
 
 
;
δ
 
"
 
β
 
j
 
L
(
w
,
 
β
)
 
=
0
 
"
 
"
Saddle Point
 
 
 
18
 
i
n
e
qu
a
li
t
y
 
c
ons
t
r
a
i
n
t
s
 
m
i
ni
m
i
z
e
 
 
f
 
(
w
)
s
ubj
e
c
t 
 
t
o
 
 
g
i
(
w
)
 
 
0
 
 
,
f
o
r
 
 
a
ll
 
 
i
h
j
 
(
w
)
 
=
 
0
 
 
,
f
o
r
 
 
a
ll
 
 
j
 
w
e
 
c
a
n
 
r
e
w
r
i
t
e
 
e
v
e
r
y
 
i
ne
qu
a
li
t
y
 
c
ons
t
r
a
i
nt
 
h
j
 
(
x
)
 
=
 
0
 
a
s
 
t
w
o
i
ne
qu
a
li
t
i
e
s
h
j
 
(
w
)
 
 
0
 
 
a
n
d
 
 
h
j
 
(
w
)
 
 
0.
 
s
o
 
 
t
he
 
 
p
r
obl
e
m
 
 
b
e
c
o
m
e
s
 
m
i
ni
m
i
z
e
 
 
f
 
(
w
)
s
ubj
e
c
t 
 
t
o
 
 
g
i
(
w
)
 
 
0
 
 
,
f
o
r
 
 
a
ll
 
 
i
 
19
 
K
a
r
us
h
 
K
uhn
 
T
u
c
k
e
r
 
t
h
e
o
r
e
m
 
m
i
ni
m
i
z
e
 
 
f
 
(
w
)
s
ubj
e
c
t 
 
t
o
 
 
g
i
(
w
)
 
 
0
 
 
,
f
o
r
 
 
a
ll
 
 
i
w
e
r
e
 
 
g
i
 
 
a
r
e
 
 
qu
a
li
e
d
c
o
n
s
t
r
a
i
n
t
s
 
de
fi
ne 
 
t
he
 
 
l
a
gr
a
ngi
a
n
 
 
f
unc
t
i
on
L
(
w
,
 
α
)
 
=
 
f
 
(
w
)
 
+
 
!
 
α
i
g
i
(
w
)
i
KK
T
 
 
t
h
e
o
r
e
m
 
ne
c
 
 
a
nd
 
s
uff
 
 
c
ondi
t
i
ons
 
 
f
o
r
 
 
a
 
p
oi
nt
 
w
 
t
o
 
 
b
e
 
 
a
s
ol
u
t
i
on
 
"
f
o
r
 
 
t
he
 
 
op
t
i
m
i
z
a
t
i
on
 
 
p
r
obl
e
m
 
 
a
r
e
 
 
t
he
 
 
e
x
i
s
t
e
nc
e
 
 
of
 
 
α
"
 
 
s
uc
h
t
h
a
t
 
δ
w
L
(
 
"
 
w
,
 
α
"
)
 
=
 
0
 
 
;
 
 
α
#
i
g
i
(
w
)
 
=
0
 
"
 
g
i
(
 
"
 
w
)
 
 
0
 
 
;
 
 
α
#
i
 
0
 
20
 
KK
T
 
-
 
s
uffi
ci
e
n
c
y
 
A
ss
u
m
e
 
 
(
"
 
w
,
 
α
"
)
 
 
s
a
t
i
s
fi
e
s
 
 
KK
T
c
ondi
t
i
ons
 
δ
w
L
(
 
"
 
w
,
 
α
"
)
 
=
 
δ
w
f
 
(
w
)
 
+
 
$
k
 
"
 
i
=
1
 
w
)
 
=
 
0
α
#
i
δ
w
g
i
(
 
"
 
δ
 
L 
 
w
,
 
α
)
 
=
 
g
i
(
w
)
 
0
 
α
 
i
 
(
 
"
 
"
 
"
 
α
#
i
g
i
(
 
"
 
T
he
n
 
w
)
 
=
 
0
;
 
 
α
#
i
 
0
 
f
 
(
w
)
 
 
f
 
(
"
 
w
)
 
 
(
δ
w
f
 
(
w
))
T
 
(
w
 
 
w
)
 
=
 
"
 
"
 
 
$
k
 
i
=
1
 
α
#
i
(
δ
w
g
i
(
 
"
 
w
))
T
 
(
w
 
 
w
)
 
 
 
$
k
 
"
 
i
=
1
 
w
α
#
i
(
g
i
(
w
)
 
 
g
i
(
 
"
 
))
 
=
 
 
$
k
 
i
=
1
 
α
#
i
g
i
(
w
)
 
 
0
 
s
o 
 
w
 
i
s
s
ol
u
t
i
on
 
"
 
21
 
s
a
dd
l
e
 
p
o
i
nt
 
m
i
ni
m
i
z
e
 
 
f
 
(
w
)
s
ubj
e
c
t 
 
t
o
 
 
g
i
(
w
)
 
 
0
 
 
,
f
o
r
 
 
a
ll
 
 
i
 
a
nd
 
 
t
he
 
 
l
a
gr
a
ngi
a
n
 
 
f
unc
t
i
on
 
L
(
w
,
 
α
)
 
=
 
f
 
(
w
)
 
+
 
!
 
α
i
g
i
(
w
)
i
 
(
 
"
 
w
,
 
α
"
)
 
 
w
i
t
h
 
 
α
#
i
 
 
0
 
 
i
s
 
 
s
a
dd
l
e
 
 
p
o
i
n
t
 
 
i
f
 
 
(
w
,
 
α
)
,
 
α
i
 
0
 
w
,
 
α
)
 
 
L
(
w
,
 
α
"
)
 
 
L
(
w
,
α
"
) 
L
(
 
"
 
"
undefined
 
0
 
5
 
10
 
2
 
4
 
6
 
8
 
10
 
200
 
0
 
!
200
 
!
400
 
!
600
 
!
800
 
!
1000
!
10
!
1200
!
5
 
!
1400
0
 
w
 
alpha
 
13
 
23
 
s
a
dd
l
e
 
p
o
i
nt
 
-
 
s
uffi
ci
e
n
c
y
 
m
i
ni
m
i
z
e
 
 
f
 
(
w
)
s
ubj
e
c
t 
 
t
o
 
 
g
i
(
w
)
 
 
0
 
 
,
f
o
r
 
 
a
ll
 
 
i
 
l
a
gr
a
ngi
a
n
 
 
f
unc
t
i
on
 
 
L
(
w
,
 
α
)
 
=
 
f
 
(
w
)
 
+
 
$
i
α
i
g
i
(
w
)
 
(
 
"
 
w
,
 
α
"
)
 
 
i
s
 
 
s
a
ddl
e
p
oi
nt
 
(
w
,
 
α
)
,
 
α
i
 
 
0
 
 
:
 
L
(
 
"
 
,
 
α
)
 
 
L
(
 
"
 
,
 
α
"
)
 
 
L
(
w
,
α
"
)
w
 
w
 
t
he
n
 
1.
w
 
i
s
 
 
s
ol
u
t
i
on
 
 
t
o
 
 
op
t
i
m
i
z
a
t
i
on
 
 
p
r
obl
e
m
"
2.
α
#
i
g
i
(
w
)
 
=
 
0
 
 
f
o
r
 
 
a
ll
 
 
i
"
 
24
 
s
a
dd
l
e
 
p
o
i
nt
 
-
 
n
e
c
e
ss
i
t
y
 
m
i
ni
m
i
z
e
 
 
f
 
(
w
)
s
ubj
e
c
t 
 
t
o
 
 
g
i
(
w
)
 
 
0
 
 
,
f
o
r
 
 
a
ll
 
 
i
w
e
r
e
 
 
g
i
 
 
a
r
e
 
 
qu
a
li
e
d
 
 
c
o
n
s
t
r
a
i
n
t
s
 
l
a
gr
a
ngi
a
n
 
 
f
unc
t
i
on
 
 
L
(
w
,
 
α
)
 
=
 
f
 
(
w
)
 
+
 
$
i
α
i
g
i
(
w
)
 
"
 
w
 
i
s
 
 
s
ol
u
t
i
on
 
 
t
o
 
 
op
t
i
m
i
z
a
t
i
on
p
r
obl
e
m
 
t
he
n
 
α
"
 
 
 
0
 
 
s
uc
h
 
 
t
h
a
t
 
 
(
"
 
w
,
 
α
"
)
 
 
i
s
 
 
s
a
ddl
e
p
oi
nt
 
(
w
,
 
α
)
,
 
α
i
 
 
0
 
 
:
 
L
(
 
"
 
,
 
α
)
 
 
L
(
 
"
 
,
 
α
"
)
 
 
L
(
w
,
α
"
)
w
 
w
 
25
 
c
ons
t
r
a
i
nt
 
qu
a
li
f
i
c
a
t
i
ons
 
m
i
ni
m
i
z
e
 
 
f
 
(
w
)
s
ubj
e
c
t 
 
t
o
 
 
g
i
(
w
)
 
 
0
 
 
,
 
 
f
o
r
 
 
a
ll
 
 
i
 
l
e
t 
 
Υ
 
 
b
e
 
 
t
he
 
 
f
e
a
s
i
bl
e
 
 
r
e
gi
on
 
 
Υ
 
=
 
{
w
|
g
i
(
w
)
 
 
0
 
 
i
}
 
t
he
n
 
 
t
he
 
 
f
ol
l
o
w
i
ng
 
 
a
ddi
t
i
on
a
l
 
 
c
ondi
t
i
ons
 
 
f
o
r
 
 
f
unc
t
i
ons
 
 
g
i
a
r
e
 
 
connec
t
e
d
 
 
(
i
)
 
 
(
ii
)
 
 
a
n
d
 
 
(
ii
i
)
 
 
(
i
)
 
 
:
 
(i)
t
he
r
e
 
 
e
x
i
s
t
s
 
 
w 
 
Υ
 
 
s
uc
h
 
 
t
h
a
t
 
 
g
i
(
w
)
 
 
0
 
 
i
(ii)
f
o
r
 
 
a
ll
 
 
nonz
e
r
o
 
 
α
 
 
[
0
,
 
1)
k
 
 
w 
 
Υ
 
 
s
uc
h
 
 
t
h
a
t
 
 
α
i
g
i
(
w
)
 
=
 
0
 
 
i
(iii)
t
he
 
f
e
a
s
i
bl
e
 
r
e
gi
on
 
Υ
 
c
on
t
a
i
ns
 
a
t
 
l
e
a
s
t
 
t
w
o
 
di
s
t
i
nc
t
 
e
l
e
m
e
n
t
s
,
a
nd
w 
 
Υ
 
 
s
uc
h
 
 
t
h
a
t
 
 
a
ll
 
 
g
i
 
 
a
r
e
 
 
a
r
e
 
 
s
t
r
i
c
t
l
y
 
 
c
onv
e
x
 
 
a
t
 
 
w
 
 
w
.
r
.
t
.
 
Υ
 
Theoretical Justification for Maximum Margins
 
Vapnik has proved the following:
 
The class of optimal linear separators has VC dimension h bounded from
above as
 
 
where 
ρ
 is the margin, D is the diameter of the smallest sphere that can
enclose all of the training examples, and m
0
 
is the dimensionality.
 
Intuitively, this implies that regardless of dimensionality 
m
0 
we can
minimize the VC dimension by maximizing the margin 
ρ
.
 
Thus, complexity of the classifier is kept small regardless of dimensionality.
 
Soft Margin Classification
 
What if the training set is not linearly separable?
Slack variables
 
ξ
i
 
can be added to allow misclassification of difficult or
noisy examples, resulting margin called 
soft
.
 
ξ
i
 
ξ
i
 
Soft Margin Classification Mathematically
 
The old formulation:
 
 
 
 
Modified formulation incorporates slack variables:
 
 
 
 
Parameter 
C
 can be viewed as a way to control overfitting:  it “trades off”
the relative importance of maximizing the margin and fitting the training
data.
Find 
w
 and b such that
Φ
(
w
)
 =
w
T
w
  is minimized
and for all (
x
i
 
,
y
i
)
,
 
i
=1..
n
 
:       
y
i
 (
w
T
x
i
 
+ 
b
)
 
1
Find 
w
 and b such that
Φ
(
w
)
 =
w
T
w
 + 
C
Σ
ξ
i   
 is minimized
and for all (
x
i
 
,
y
i
)
,
 
i
=1..
n
 
:       
y
i
 (
w
T
x
i
 
+ 
b
)
 
1 – 
ξ
i,
  ,    
ξ
i
 
0
 
Soft Margin Classification – Solution
 
Dual problem is identical to separable case (would 
not 
be identical if the 2-
norm penalty for slack variables 
C
Σ
ξ
i
2
 was used in primal objective, we
would need additional Lagrange multipliers for slack variables):
 
 
 
 
 
Again, 
x
i
 
with non-zero 
α
i
 
will be support vectors.
Solution to the dual problem is:
Find 
α
1
α
N
 
such that
Q
(
α
)
 =
Σ
α
i
  
- 
½
ΣΣ
α
i
α
j
y
i
y
j
x
i
T
x
j
 
is maximized and
(1)
  
Σ
α
i
y
i
 
= 0
(2)  0 
 
α
i
 
C
 for all 
α
i
w
 
 =
Σ
α
i
y
i
x
i
b= y
k
(1- 
ξ
k
) - 
Σ
α
i
y
i
x
i
T
x
k
  
 
 for any 
k
 s.t. 
α
k
>
0
f
(
x
) = 
Σ
α
i
y
i
x
i
T
x + 
b
 
Again, we don’t need to
compute 
w
 explicitly for
classification:
 
Non-linear SVMs
 
Datasets that are linearly separable with some noise work out great:
 
 
 
But what are we going to do if the dataset is just too hard?
 
 
How about… mapping data to a higher-dimensional space:
0
 
0
 
0
 
x
2
 
x
 
x
x
 
Non-linear SVMs:  Feature spaces
 
General idea:   the original feature space can always be mapped to some
higher-dimensional feature space where the training set is separable:
 
Φ
:  
x
 
 
φ
(
x
)
 
 
 
What Functions are Kernels?
 
For some functions 
K
(
x
i
,
x
j
) checking that 
K
(
x
i
,
x
j
)= 
φ
(
x
i
)
 
T
φ
(
x
j
) can be
cumbersome.
Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
Semi-positive definite symmetric functions correspond to a semi-positive
definite symmetric Gram matrix:
 
K=
 
 
Positive Definite Matrices
A square matrix A is 
positive definite
 
if
x
T
Ax>0
 for all nonzero column vectors
x
.
It is 
negative definite
 if 
x
T
Ax 
< 0
 for all
nonzero 
x
.
It is 
positive semi-definite
 if 
x
T
Ax 
 0
.
And 
negative semi-definite
 if 
x
T
Ax
 
 0
for all x.
 
 
 
 
 
 
 
There are two basic approaches to solve 
q
-class problems
(        ) with SVMs ([10],[11]):
 
1- One  vs. Others:
works by constructing a “regular” SVM 
     
for each class 
i 
that
separates that class from all the other classes (class “ i” positive
and “not i” negative). Then we  check the output  of each of the q
SVM classifiers  for our input and  choose the class i  that its
corresponding SVM has the maximum output.
 
2-Pairwise (one vs one):
 
We  construct “Regular” SVM for each pair of classes (so we
construct q(q-1)/2 SVMs). Then we use  “max-wins” voting
strategy: we test each SVM on the input and each time an
SVM chooses a certain class we add vote to that class. Then
we choose the class with highest number of votes.
 
SVM For Multi-class classification:
 
(more than two
classes)
 
Interactive media and animations
 
SVM Applets
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml
http://www.smartlab.dibe.unige.it/Files/sw/Applet%20SVM/svmapplet.html
http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html
http://www.dsl-lab.org/svm_tutorial/demo.html(requires Java 3D)
 
Animations
Support Vector Machines:
http://www.cs.ust.hk/irproj/Regularization%20Path/svmKernelpath/2moons.avi
http://www.cs.ust.hk/irproj/Regularization%20Path/svmKernelpath/2Gauss.avih
ttp://www.youtube.com/watch?v=3liCbRZPrZA
Support Vector Regression:
http://www.cs.ust.hk/irproj/Regularization%20Path/movie/ga0.5lam1.avi
SVM Implementations
 
LibSVM
SVMLight
Weka
Scikit-learn – Python
Others
Slide Note
Embed
Share

Support Vector Machines (SVMs) with linear kernels are powerful tools for binary classification tasks. They aim to find a separating hyperplane that maximizes the margin between classes, focusing on support vectors closest to the decision boundary. The formulation involves optimizing a quadratic problem to find the hyperplane parameters w and b that best separate the data points. Linear SVMs emphasize maximizing the margin and are effective in scenarios where data points are linearly separable.

  • SVM
  • Linear SVM
  • Binary Classification
  • Maximum Margin

Uploaded on Mar 22, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Support Vector Machines Kernels based on notes by Andrew Ng and Others

  2. Problem definition: -We are given a set of n points (vectors) : such that is a vector of length m , and each belong to one of two classes we label them by +1 and -1 . -So our training set is: ( , ),( , ),....( , ) n n x y x y x y , { 1, 1} i i i x R y + , ,....... x x x ix 1 2 n So the decision function will be ( ) f x sign w x b = + ( ) 1 1 2 2 m - We want to find a separating hyperplane that separates these points into the two classes. The positives (class +1 ) and The negatives (class -1 ). (Assuming that they are linearly separable) w x b + = 0

  3. Machine Learning Group Linear Separators Which of the linear separators is optimal? 3 University of Texas at Austin

  4. Machine Learning Group Classification Margin + T w x b Distance from example xito the separator is Examples closest to the hyperplane are support vectors. Margin of the separator is the distance between support vectors. = i r w r 4 University of Texas at Austin

  5. Machine Learning Group Maximum Margin Classification Maximizing the margin is good according to intuition and PAC theory. Implies that only support vectors matter; other training examples are ignorable. 5 University of Texas at Austin

  6. Machine Learning Group Linear SVM Mathematically Let training set {(xi, yi)}i=1..n, xi Rd, yi {-1, 1} be separated by a hyperplane with margin . Then for each training example (xi, yi): wTxi+ b - /2 if yi= -1 wTxi+ b /2 yi(wTxi+ b) /2 if yi= 1 For every support vector xsthe above inequality is an equality. After rescaling w and b by /2 in the equality, we obtain that distance between each xs and the hyperplane is + T w x y ( ) 1 b = = s s r w w Then the margin can be expressed through (rescaled) w and b as: 2 2 = = r w 6 University of Texas at Austin

  7. Machine Learning Group Linear SVMs Mathematically (cont.) Then we can formulate the quadratic optimization problem: Find w and b such that 2 = is maximized w and for all (xi, yi), i=1..n : yi(wTxi+ b) 1 Which can be reformulated as: Find w and b such that (w) = ||w||2=wTw is minimized and for all (xi, yi), i=1..n : yi(wTxi+ b) 1 7 University of Texas at Austin

  8. SVM : Linear separable case. The optimization problem: -Our optimization problem so far: 1 2 I do remember the Lagrange Multipliers from Calculus! w minimize 2 + ) 1 T s.t. w x ( y b i i -We will solve this problem by introducing Lagrange multipliers associated with the constrains: i 1 minimize ( , , ) 2 . i st n 2 = + ) 1) ( ( L w b w y x w b p i i i = 1 i 0

  9. constrained optimization on convex set X , given convex functions f , g1, g2, ...., gk, h1,h2, ...., hm the problem minimize f (w) subject to gi(w) 0 ,for all i hj(w) = 0 ,for all j set. If f is strict convex the solution has as its solution a convex is unique (if exists) we functions f , g1, g2, ...., gk, h1, h2, ...., hm like convexity, differentiability etc.T hat will still not be enough though.... will assume all the good things one can imagine about 14

  10. equality constraints only minimize f (w) subject to hj(w) = 0 ,for all j define the lagrangian function ! L(w, ) = f (w) + jhj(w) j L agrange t heorem nec[essary] and suff[icient] conditions for a point " existence of "such that wL( "w, ) = 0 ; w to be an optimum (ie a solution for the problem above) are the jL(w, ) = " " " 0 15

  11. Saddle Point

  12. inequality constraints minimize f (w) subject to gi(w) 0 ,for all i hj(w) = 0 ,for all j we can rewrite every inequality constraint hj(x) = 0 as two inequalities hj(w) 0 and hj(w) 0. so the problem becomes minimize f (w) subject to gi(w) 0 ,for all i 18

  13. Karush Kuhn Tucker theorem minimize f (w) subject to gi(w) 0 ,for all i were gi are qualifi ed const raint s define the lagrangian function ! L(w, ) = f (w) + igi(w) i K K T t heorem nec and suff conditions for a point w to be a solution for the optimization problem are the existence of " such that wL( "w, " ) = 0 ; #igi(w) = gi( "w) 0 ; #i " " 0 0 19

  14. KKT - sufficiency w, " ) satisfies KKT conditions $k " " w) = 0; #i 0 Assume ( " wL( " w, " ) = wf (w) + L w, ) = gi(w) 0 #igi( " w) = 0 #i wgi( " i= 1 ( " " i T hen f (w) f ( " $ki= 1 #i( wgi( "w))T(w w) $k w) ( wf (w))T(w w) = " " w #i(gi(w) gi( " )) = " i= 1 $k i= 1 #igi(w) 0 so w is solution " 20

  15. saddle point minimize f (w) subject to gi(w) 0 ,for all i and the lagrangian function ! L(w, ) = f (w) + igi(w) i ( "w, " ) with #i 0 is saddle point if (w, ), i 0 w, ) L(w, " ) L(w, " ) L( " " 21

  16. 200 0 ! 200 ! 400 ! 600 ! 800 ! 1000 ! 10 ! 1200 ! 5 ! 1400 0 0 2 5 13 4 6 8 10 10 w alpha

  17. saddle point - sufficiency minimize f (w) subject to gi(w) 0 ,for all i $ lagrangian function L(w, ) = f (w) + igi(w) w, " ) is saddle point " ) i ( " (w, ), i 0 : L( " , ) L( " , " ) L(w, w w then 1.w is solution to optimization problem 2. #igi(w) = 0 for all i " " 23

  18. saddle point - necessity minimize f (w) subject to gi(w) 0 ,for all i were gi are qualifi ed const raint s $ lagrangian function L(w, ) = f (w) + igi(w) w is solution to optimization problem i " then w, " ) is saddle point " ) w " 0 such that ( " (w, ), i 0 : L( " , ) L( " , " ) L(w, w 24

  19. constraint qualifications minimize f (w) subject to gi(w) 0 , for all i let be the feasible region = { w|gi(w) 0 i} then the following additional conditions for functions gi are connected (i) (ii) and (iii) (i) : (i) there exists w such that gi(w) 0 i (ii) for all nonzero [0, 1)k w such that igi(w) = 0 i (iii) the feasible region contains at least two distinct elements, and w such that all gi are are strictly convex at w w.r.t. 25

  20. Machine Learning Group Theoretical Justification for Maximum Margins Vapnik has proved the following: The class of optimal linear separators has VC dimension h bounded from above as , min 2 2 D + 1 h m 0 where is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples, and m0is the dimensionality. Intuitively, this implies that regardless of dimensionality m0 we can minimize the VC dimension by maximizing the margin . Thus, complexity of the classifier is kept small regardless of dimensionality. 26 University of Texas at Austin

  21. Machine Learning Group Soft Margin Classification What if the training set is not linearly separable? Slack variables ican be added to allow misclassification of difficult or noisy examples, resulting margin called soft. i i 27 University of Texas at Austin

  22. Machine Learning Group Soft Margin Classification Mathematically The old formulation: Find w and b such that (w) =wTw is minimized and for all (xi,yi), i=1..n : yi(wTxi+ b) 1 Modified formulation incorporates slack variables: Find w and b such that (w) =wTw + C i is minimized and for all (xi,yi), i=1..n : yi(wTxi+ b) 1 i,, i 0 Parameter C can be viewed as a way to control overfitting: it trades off the relative importance of maximizing the margin and fitting the training data. 28 University of Texas at Austin

  23. Machine Learning Group Soft Margin Classification Solution Dual problem is identical to separable case (would not be identical if the 2- norm penalty for slack variables C i2was used in primal objective, we would need additional Lagrange multipliers for slack variables): Find 1 Nsuch that Q( ) = i- i jyiyjxiTxjis maximized and (1) iyi= 0 (2) 0 i C for all i Again, xiwith non-zero iwill be support vectors. Solution to the dual problem is: Again, we don t need to compute w explicitly for classification: w = iyixi b= yk(1- k) - iyixiTxk for any k s.t. k>0 f(x) = iyixiTx + b 29 University of Texas at Austin

  24. Machine Learning Group Non-linear SVMs Datasets that are linearly separable with some noise work out great: x 0 But what are we going to do if the dataset is just too hard? x 0 How about mapping data to a higher-dimensional space: x2 x 0 31 University of Texas at Austin

  25. Machine Learning Group Non-linear SVMs: Feature spaces General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: : x (x) 32 University of Texas at Austin

  26. What Functions are Kernels? For some functions K(xi,xj) checking that K(xi,xj)= (xi)T (xj) can be cumbersome. Mercer s theorem: Every semi-positive definite symmetric function is a kernel Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix: K(x1,x1) K(x1,x2) K(x1,x3) K(x1,xN) K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN) K= K(xN,x1) K(xN,x2) K(xN,x3) K(xN,xN) 35

  27. Positive Definite Matrices A square matrix A is positive definite if xTAx>0 for all nonzero column vectors x. It is negative definite if xTAx < 0 for all nonzero x. It is positive semi-definite if xTAx 0. And negative semi-definite if xTAx 0 for all x. 36

  28. SVM For Multi-class classification: (more than two classes) There are two basic approaches to solve q-class problems ( ) with SVMs ([10],[11]): 1- One vs. Others: works by constructing a regular SVM separates that class from all the other classes (class i positive and not i negative). Then we check the output of each of the q SVM classifiers for our input and choose the class i that its corresponding SVM has the maximum output. 2-Pairwise (one vs one): We construct Regular SVM for each pair of classes (so we construct q(q-1)/2 SVMs). Then we use max-wins voting strategy: we test each SVM on the input and each time an SVM chooses a certain class we add vote to that class. Then we choose the class with highest number of votes. q 2 i for each class i that = + t ( ( ) g x ) w x b

  29. Interactive media and animations SVM Applets http://www.csie.ntu.edu.tw/~cjlin/libsvm/ http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml http://www.smartlab.dibe.unige.it/Files/sw/Applet%20SVM/svmapplet.html http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html http://www.dsl-lab.org/svm_tutorial/demo.html(requires Java 3D) Animations Support Vector Machines: http://www.cs.ust.hk/irproj/Regularization%20Path/svmKernelpath/2moons.avi http://www.cs.ust.hk/irproj/Regularization%20Path/svmKernelpath/2Gauss.avih ttp://www.youtube.com/watch?v=3liCbRZPrZA Support Vector Regression: http://www.cs.ust.hk/irproj/Regularization%20Path/movie/ga0.5lam1.avi

  30. SVM Implementations LibSVM SVMLight Weka Scikit-learn Python Others

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#