Understanding Optimization Techniques in Neural Networks

Optimizing neural networks

Usman Roshan

How do we find the min value of a

function?

•

Given f(x) find x that minimizes f(x). This is a

key fundamental problem with broad

applications across different areas

•

Let us start with f(x) that is non-differentiable.

For example the objective of traveling

salesman problem is non-differentiable.

Local search

•

Local search is a fundamental search method

in machine learning and AI

•

Given a non-differentiable objective we

perform local search to find its minimum

•

If the objective is differentiable we get the

optimal search direction with the gradient

Neural network objective

•

Non-linear objective, multiple local minima

•

As a result optimization is much harder than

that of a convex objective

•

Standard approach: gradient descent:

–

Calculate first derivatives of each hidden variable

–

For inner layers we use the chain rule (see google

sheet for derivations)

Gradient descent

•

So we run gradient descent until convergence,

then what is the problem?

•

May converge on a local minima and require

random restarts

•

Overfitting: a big problem for many years

•

How can we prevent overfitting?

•

How can we explore the search space better

without getting stuck in local minima?

Stochastic gradient descent

•

A simple but beautifully powerful idea introduced

by Leon Bottou in 2000

•

Original SGD:

–

While not converged:

•

Select a single datapoint in order from the data

•

Compute gradient with just one point

•

Update parameters

•

Pros: broader search

•

Cons: final solution may be poor, may be hard to

converge

Stochastic gradient descent

•

Mini-batch SGD:

–

While not converged:

•

Select a random batch of datapoints

•

Compute gradient with the batch

•

Update parameters

•

Mini-batch pros: generally better solution with

better convergence than single datapoint

•

Batch sizes are usually small

Learning rate

•

Key to the search is the step size

•

Ideally we start with a somewhat large size

(0.1 or 0.01) and reduce by power of 10 after

a few epochs

•

Adaptive step size is the best but may slow the

search

Dropout

•

A simple method introduced in 2014 to prevent

overfitting

•

Procedure:

–

During training we decide with probability p to update

a node’s weights or not.

–

We set p to be typically 0.5

•

Highly effective in deep learning:

–

Decreases overfitting

–

Increases training time

•

Can be loosely interpreted as ensemble of

networks

Slide Note

Embed Share

Download Presentation

Optimization is essential in neural networks to find the minimum value of a function. Techniques like local search, gradient descent, and stochastic gradient descent are used to minimize non-linear objectives with multiple local minima. Challenges such as overfitting and getting stuck in local minima are addressed through methods like random restarts and mini-batch SGD.

kanye Follow

Uploaded on Aug 05, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Optimizing neural networks Usman Roshan

How do we find the min value of a function? Given f(x) find x that minimizes f(x). This is a key fundamental problem with broad applications across different areas Let us start with f(x) that is non-differentiable. For example the objective of traveling salesman problem is non-differentiable.

Local search Local search is a fundamental search method in machine learning and AI Given a non-differentiable objective we perform local search to find its minimum If the objective is differentiable we get the optimal search direction with the gradient

Neural network objective Non-linear objective, multiple local minima As a result optimization is much harder than that of a convex objective Standard approach: gradient descent: Calculate first derivatives of each hidden variable For inner layers we use the chain rule (see google sheet for derivations)

Gradient descent So we run gradient descent until convergence, then what is the problem? May converge on a local minima and require random restarts Overfitting: a big problem for many years How can we prevent overfitting? How can we explore the search space better without getting stuck in local minima?

Stochastic gradient descent A simple but beautifully powerful idea introduced by Leon Bottou in 2000 Original SGD: While not converged: Select a single datapoint in order from the data Compute gradient with just one point Update parameters Pros: broader search Cons: final solution may be poor, may be hard to converge

Stochastic gradient descent Mini-batch SGD: While not converged: Select a random batch of datapoints Compute gradient with the batch Update parameters Mini-batch pros: generally better solution with better convergence than single datapoint Batch sizes are usually small

Learning rate Key to the search is the step size Ideally we start with a somewhat large size (0.1 or 0.01) and reduce by power of 10 after a few epochs Adaptive step size is the best but may slow the search

Dropout A simple method introduced in 2014 to prevent overfitting Procedure: During training we decide with probability p to update a node s weights or not. We set p to be typically 0.5 Highly effective in deep learning: Decreases overfitting Increases training time Can be loosely interpreted as ensemble of networks

Understanding Optimization Techniques in Neural Networks

Download Presentation

Presentation Transcript

Related

More Related Content