Machine learning optimization

Machine learning optimization
Usman Roshan
Machine learning
Mathematical programming optimization for
convex functions
Comes down to gradient descent
Stochastic gradient descent leads to even
faster search
Gradient descent
Given a convex function f(x,y) how can we
find x and y that minimizes this?
Solution is gradient descent
The gradient vector points in the direction of
greatest increase in the function.
Therefore solution is to move in small
increments in the negative direction of the
gradient until we reach the optimum.
Gradient descent example
Minimize f(x)=1 + (x-5)
2
Derivative w.r.t x is df/dx = 2(x-5)
d
f
/
d
x
 
=
 
F
(
x
)
 
=
 
2
(
x
-
5
)
Algorithm:
δ=.1
start with a random value of x (e.g. x = 10)
prev = f(x)
while (!stop){
x = x - 
δF(x)
if(f(x) > prev) { stop = true ; }
prev = f(x)
Gradient descent example
Initialize
x
 
=
 
1
0
,
 
f
(
x
)
 
=
 
2
6
,
 
p
r
e
v
 
=
 
2
6
Step 1
F
(
x
)
 
=
 
1
0
,
 
x
 
=
 
1
0
 
-
 
.
1
(
1
0
)
 
=
 
9
,
 
f
(
x
)
 
=
 
1
7
,
 
p
r
e
v
 
=
 
1
7
Step 2
F
(
x
)
 
=
 
8
,
 
x
 
=
 
9
 
-
 
.
1
(
8
)
 
=
 
8
.
2
,
 
f
(
x
)
 
=
 
1
1
.
2
4
,
 
p
r
e
v
 
=
 
1
1
.
2
4
Step 3
F
(
x
)
 
=
 
6
.
4
,
 
x
 
=
 
8
.
2
 
-
 
.
1
(
6
.
4
)
 
 
=
 
7
.
5
6
,
 
f
(
x
)
 
=
 
7
.
5
5
3
6
,
 
p
r
e
v
=
 
7
.
5
5
3
6
Step 4
F
(
x
)
 
=
 
5
.
1
2
,
 
x
 
=
 
7
.
5
6
 
-
 
.
5
1
2
 
=
 
7
.
0
4
8
,
 
f
(
x
)
 
=
 
5
.
1
9
,
 
p
r
e
v
 
=
5
.
1
9
Constrained optimization
Most machine learning problems require
optimizing a convex function with convex
constraints
In this case we apply Lagrange multipliers
The KKT theorem is their generalization (see
handout)
Direct optimization of NP-hard
K-means is the classic example for
unsupervised learning
For supervised learning consider finding the
plane that gives lowest number of
misclassified points.
Why would we want to find such planes?
Reduces effect of outliers
Direct optimization motivation
Direct optimization motivation
How do we directly optimize?
Today we look at local search
Local search
Coordinate descent:
Fix all coordinates 
w
i
 for I = 1 to d
Randomly pick one 
w
i
 and optimize
Make incremental changes in 
w
i
 until there is no
change
Pick a different random 
w
i
 and optimize again.
Repeat until no change in objective after circling
through all 
w
i
.
Local minima
A major problem for optimizing NP-hard
problems
How to overcome?
Random restarts
Simulated annealing
MCMC
Iterated local search
Iterated local search
Results
Data: 33 randomly selected datasets from UCI
repository
Some were multi-class and converted into
binary (largest class separated from the
others)
We created ten random 90:10 splits for each
dataset. All model parameters were learnt on
the training dataset
Results
Results
Average error of three objective functions:
w_increment = .01
Margin =14.6%, error only = 23.4% , margin + error =
13.6%
w_increment = 100
Margin = 14.3, error only = 15.5% , margin + error =
13.2%
liblinear = 12.4%
Comparable runtime: both BSP and liblinear
finish within seconds but liblinear is faster
Slide Note
Embed
Share

Dive into the world of machine learning optimization with a focus on gradient descent, mathematical programming, and constrained optimization. Explore how to minimize functions using gradient descent and Lagrange multipliers, as well as the motivation behind direct optimization methods. Discover the applications of optimization in unsupervised and supervised learning scenarios.

  • Machine Learning
  • Optimization
  • Gradient Descent
  • Constrained Optimization
  • Unsupervised Learning

Uploaded on Feb 27, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Machine learning optimization Usman Roshan

  2. Machine learning Mathematical programming optimization for convex functions Comes down to gradient descent Stochastic gradient descent leads to even faster search

  3. Gradient descent Given a convex function f(x,y) how can we find x and y that minimizes this? Solution is gradient descent The gradient vector points in the direction of greatest increase in the function. Therefore solution is to move in small increments in the negative direction of the gradient until we reach the optimum.

  4. Gradient descent example Minimize f(x)=1 + (x-5)2 Derivative w.r.t x is df/dx = 2(x-5) df/dx = F(x) = 2(x-5) Algorithm: =.1 start with a random value of x (e.g. x = 10) prev = f(x) while (!stop){ x = x - F(x) if(f(x) > prev) { stop = true ; } prev = f(x)

  5. Gradient descent example Initialize x = 10, f(x) = 26, prev = 26 Step 1 F(x) = 10, x = 10 - .1(10) = 9, f(x) = 17, prev = 17 Step 2 F(x) = 8, x = 9 - .1(8) = 8.2, f(x) = 11.24, prev = 11.24 Step 3 F(x) = 6.4, x = 8.2 - .1(6.4) = 7.56, f(x) = 7.5536, prev = 7.5536 Step 4 F(x) = 5.12, x = 7.56 - .512 = 7.048, f(x) = 5.19, prev = 5.19

  6. Constrained optimization Most machine learning problems require optimizing a convex function with convex constraints In this case we apply Lagrange multipliers The KKT theorem is their generalization (see handout)

  7. Direct optimization of NP-hard K-means is the classic example for unsupervised learning For supervised learning consider finding the plane that gives lowest number of misclassified points. Why would we want to find such planes? Reduces effect of outliers

  8. Direct optimization motivation

  9. Direct optimization motivation

  10. How do we directly optimize? Today we look at local search

  11. Local search Coordinate descent: Fix all coordinates wifor I = 1 to d Randomly pick one wiand optimize Make incremental changes in wiuntil there is no change Pick a different random wiand optimize again. Repeat until no change in objective after circling through all wi.

  12. Local minima A major problem for optimizing NP-hard problems How to overcome? Random restarts Simulated annealing MCMC Iterated local search

  13. Iterated local search

  14. Results Data: 33 randomly selected datasets from UCI repository Some were multi-class and converted into binary (largest class separated from the others) We created ten random 90:10 splits for each dataset. All model parameters were learnt on the training dataset

  15. Results

  16. Results Average error of three objective functions: w_increment = .01 Margin =14.6%, error only = 23.4% , margin + error = 13.6% w_increment = 100 Margin = 14.3, error only = 15.5% , margin + error = 13.2% liblinear = 12.4% Comparable runtime: both BSP and liblinear finish within seconds but liblinear is faster

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#