Essential Tips for Training Neural Networks from Scratch

Tips for Training
Neural Network
scratch the surface
Two Concerns
There are two things you have to concern.
Initialization
 
For gradient descent, we need to pick an
initialization parameter 
θ
0
.
Do not set all the parameters 
θ
0
 equal
Set the parameters in 
θ
0 
randomly
Learning Rate
Toy Example
x = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]
y
 = [0.1, 0.4, 0.9, 1.6, 2.2, 2.5, 2.8, 3.5, 3.9, 4.7, 5.1, 5.3, 6.3, 6.5, 6.7, 7.5, 8.1, 8.5, 8.9, 9.5]
Training Data (20 examples)
Set the learning rate 
η
 carefully
Toy Example
Learning Rate
Error Surface: C(w,b)
target
start
Toy Example
Learning Rate
Different learning rate 
η
Gradient descent
 
Gradient descent
 
Stochastic Gradient descent
 
Pick an example x
r
 
If all example x
r
 have
equal probabilities to
be picked
Gradient descent
Stochastic Gradient descent
 
Starting at 
θ
0
 
pick x
1
 
pick x
2
 
pick x
r
 
pick x
R
 
pick x
1
 
Seen all the
examples once
 
One epoch
 
 
 
 
Gradient descent
Toy Example
 
Gradient descent
 
Stochastic Gradient descent
 
1 epoch
 
See all
examples
 
See only one
example
Update 20 times
in an epoch
Gradient descent
Shuffle your data
Stochastic Gradient
descent
Pick an example x
r
 
Mini Batch Gradient Descent
 
Pick B examples as
a batch b
 
Average the gradient of the
examples in the batch b
 
(B is batch size)
Gradient descent
Gradient descent
Real Example: Handwriting Digit Classification
 
Batch size = 1
Gradient descent
Two Concerns
There are two things you have to concern.
Generalization
 
You pick a “best” parameter set 
θ
*
 
Training Data:
 
Testing Data:
 
However,
 
Training data and testing data have different distribution.
Panacea
 
Have more training data if possible ……
Create
 more training data (?)
 
Shift 15
 
Handwriting recognition:
Reference
Chapter 3 of Neural network and Deep Learning
http://neuralnetworksanddeeplearning.com/ch
ap3.html
Appendix
 
Overfitting
The function that performs well on the training data does
not necessarily  perform well on the testing data.
 
Training Data:
 
Testing Data:
 
Overfitting in our daily life
:
 
Memorize the answers of the previous examples
……
 
Joke for overfiting
http://xkcd.com/1122/
Initialization
 
For gradient descent, we need to pick an
initialization parameter 
θ
0
.
Do not set all the parameters 
θ
0
 equal
Or your parameters will always be equal, no matter
how many times you update the parameters
Randomly pick 
θ
0
If the last layer has more neurons, the initialization
values should be smaller.
E.g. Last layer has N
l-1
MNIST
The MNIST data comes in two parts. The first part contains
60,000 images to be used as training data. These images are
scanned handwriting samples from 250 people, half of
whom were US Census Bureau employees, and half of
whom were high school students. The images are greyscale
and 28 by 28 pixels in size. The second part of the MNIST
data set is 10,000 images to be used as test data. Again,
these are 28 by 28 greyscale images.
git clone 
https://github.com/mnielsen/neural-networks-
and-deep-learning.git
http://yann.lecun.com/exdb/mnist/
http://www.deeplearning.net/tutorial/gettingstarted.html
MNIST
The current (2013) record is classifying 9,979 of
10,000 images correctly. This was done by Li
Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun,
and Rob Fergus.
At that level the performance is close to human-
equivalent, and is arguably better, since quite a few
of the MNIST images are difficult even for humans
to recognize with confidence.
Early Stopping
For iteration
Layer
Difficulty of Deep
Lower layer cannot plan
Slide Note
Embed
Share

Neural network training involves key considerations like optimization for finding optimal parameters and generalization for testing data. Initialization, learning rate selection, and gradient descent techniques play crucial roles in achieving efficient training. Understanding the nuances of stochastic gradient descent and batch gradient descent is essential for successful training of neural networks.

  • Neural Networks
  • Training Tips
  • Optimization
  • Generalization
  • Gradient Descent

Uploaded on Nov 27, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Tips for Training Neural Network scratch the surface

  2. Two Concerns There are two things you have to concern. Optimization Can I find the best parameter set * in limited of time? Generalization Is the best parameter set * good for testing data as well?

  3. Initialization For gradient descent, we need to pick an initialization parameter 0. Do not set all the parameters 0equal Set the parameters in 0 randomly

  4. Learning Rate ( ) = C 1 1 i i i Set the learning rate carefully Toy Example y z w = 1 w + x = * = 0 b y = z b 1 Training Data (20 examples) x = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5] y = [0.1, 0.4, 0.9, 1.6, 2.2, 2.5, 2.8, 3.5, 3.9, 4.7, 5.1, 5.3, 6.3, 6.5, 6.7, 7.5, 8.1, 8.5, 8.9, 9.5]

  5. Learning Rate ( ) = C 1 1 i i i Toy Example Error Surface: C(w,b) start target

  6. = 1 . 0 Learning Rate Toy Example Different learning rate = . 0 01 = . 0 001 ~ k 3 updates ~ 0. 3 updates k

  7. ( ) x 1 r r = r r y C f R 1 2 Gradient descent = r C R Gradient descent ( ) ( ) ( ) 1 r = 1 1 i r i = C 1 1 i i i C C R Stochastic Gradient descent ( ) = C 1 1 i i r i Pick an example xr If all example xrhave equal probabilities to be picked ( ) ( ) 1 r = 1 1 r i r i E C C R

  8. Gradient descent Stochastic Gradient descent Training Data: ( )( , ) ( ) ( ) 1 1 2 2 r r R R , , , , , , x y x y x y x y Starting at 0 pick x1 ( ) ( ) ( C = = 1 0 1 0 C C 2 1 2 1 pick x2 Seen all the examples once ) = 1 1 r r r r pick xr One epoch ( ( ) ) pick xR = R R 1 R R 1 C + = R 1 R 1 R C pick x1

  9. Gradient descent See only one example See all examples Toy Example Update 20 times in an epoch Gradient descent Stochastic Gradient descent 1 epoch

  10. Gradient descent Gradient descent = ( ) ( ) ( ) 1 r = 1 1 i r i C 1 1 i i i C C R Stochastic Gradient descent Pick an example xr Mini Batch Gradient Descent Pick B examples as a batch b (B is batch size) ( ) = C 1 1 i i r i ( ) 1 x r = 1 1 i i r i C B b Average the gradient of the examples in the batch b Shuffle your data

  11. Gradient descent Real Example: Handwriting Digit Classification Batch size = 1 Gradient descent

  12. Two Concerns There are two things you have to concern. Optimization Can I find the best parameter set * in limited of time? Generalization Is the best parameter set * good for testing data as well?

  13. Generalization You pick a best parameter set * Training Data: ( ) ( ) ry , r x *= r r y : ; r f x However, ( ) ux * u u y ; f x Testing Data: Testing Data: Training Data: Training data and testing data have different distribution.

  14. Panacea Have more training data if possible Create more training data (?) Handwriting recognition: Original Training Data: Created Training Data: Shift 15

  15. Reference Chapter 3 of Neural network and Deep Learning http://neuralnetworksanddeeplearning.com/ch ap3.html

  16. Appendix

  17. Overfitting The function that performs well on the training data does not necessarily perform well on the testing data. ( x , Training Data: ( ) x ~ ) = r r y : r f ry r ( ) x ~ ux u u y f Testing Data: Overfitting in our daily life: Memorize the answers of the previous examples

  18. Joke for overfiting http://xkcd.com/1122/

  19. Initialization For gradient descent, we need to pick an initialization parameter 0. Do not set all the parameters 0equal Or your parameters will always be equal, no matter how many times you update the parameters Randomly pick 0 If the last layer has more neurons, the initialization values should be smaller. E.g. Last layer has Nl-1 ( 1 / 1 , 0 ~ l ij N N w ) / 1 l l ij ~ U 1 / 1 , w N N 1 l l

  20. MNIST The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. git clone https://github.com/mnielsen/neural-networks- and-deep-learning.git http://yann.lecun.com/exdb/mnist/ http://www.deeplearning.net/tutorial/gettingstarted.html

  21. MNIST The current (2013) record is classifying 9,979 of 10,000 images correctly. This was done by Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. At that level the performance is close to human- equivalent, and is arguably better, since quite a few of the MNIST images are difficult even for humans to recognize with confidence.

  22. Early Stopping For iteration Layer

  23. Difficulty of Deep Lower layer cannot plan

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#