Understanding Optimization Techniques in Neural Networks

Slide Note
Embed
Share

Optimization is essential in neural networks to find the minimum value of a function. Techniques like local search, gradient descent, and stochastic gradient descent are used to minimize non-linear objectives with multiple local minima. Challenges such as overfitting and getting stuck in local minima are addressed through methods like random restarts and mini-batch SGD.


Uploaded on Aug 05, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Optimizing neural networks Usman Roshan

  2. How do we find the min value of a function? Given f(x) find x that minimizes f(x). This is a key fundamental problem with broad applications across different areas Let us start with f(x) that is non-differentiable. For example the objective of traveling salesman problem is non-differentiable.

  3. Local search Local search is a fundamental search method in machine learning and AI Given a non-differentiable objective we perform local search to find its minimum If the objective is differentiable we get the optimal search direction with the gradient

  4. Neural network objective Non-linear objective, multiple local minima As a result optimization is much harder than that of a convex objective Standard approach: gradient descent: Calculate first derivatives of each hidden variable For inner layers we use the chain rule (see google sheet for derivations)

  5. Gradient descent So we run gradient descent until convergence, then what is the problem? May converge on a local minima and require random restarts Overfitting: a big problem for many years How can we prevent overfitting? How can we explore the search space better without getting stuck in local minima?

  6. Stochastic gradient descent A simple but beautifully powerful idea introduced by Leon Bottou in 2000 Original SGD: While not converged: Select a single datapoint in order from the data Compute gradient with just one point Update parameters Pros: broader search Cons: final solution may be poor, may be hard to converge

  7. Stochastic gradient descent Mini-batch SGD: While not converged: Select a random batch of datapoints Compute gradient with the batch Update parameters Mini-batch pros: generally better solution with better convergence than single datapoint Batch sizes are usually small

  8. Learning rate Key to the search is the step size Ideally we start with a somewhat large size (0.1 or 0.01) and reduce by power of 10 after a few epochs Adaptive step size is the best but may slow the search

  9. Dropout A simple method introduced in 2014 to prevent overfitting Procedure: During training we decide with probability p to update a node s weights or not. We set p to be typically 0.5 Highly effective in deep learning: Decreases overfitting Increases training time Can be loosely interpreted as ensemble of networks

Related


More Related Content