Understanding Neural Network Training and Structure

Slide Note
Embed
Share

This text delves into training a neural network, covering concepts such as weight space symmetries, error back-propagation, and ways to improve convergence. It also discusses the layer structures and notation of a neural network, emphasizing the importance of finding optimal sets of weights and offsets for nonlinear activation functions. Additionally, it explores the symmetries in weight spaces and the permutations of nodes in neural network layers.


Uploaded on Oct 04, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Training a Neural Network Yang Zhang

  2. Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback

  3. Neural Network: Recap & Notation ?? Bottom layer: ?= ?1+ ?1?0 ?1 ? ?1 ? ?= ?1 ?1 ? ?? Middle layer: ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1 ? ? ??+1 = ??+1+ ??+1?? ??+1 = ??+1 ??+1 ? ? ? ? ? ??+1 1 ??+1 2 ??+1 ??+1 Top layer: ?= ??+ ???? 1 ??= ?? ?? ? ? ? ? ?? ??+1 1 ??+1 2 ??+1 ??+1 ? ? ? ? 1 ?? 1 ?? 2 ?? ?? ? ?1:?,?1:? ??,?? = ? ? ? ?1 1 ?1 2 ?1 ?1 ?=1 Representation Power ? ? ? ? ??= ? ?0 ?1 1 ?1 2 ?1 ?1 ? ? ? 1 ?0 1 ?0 2 ?0 ?0

  4. Neural Network: Recap & Notation Bottom layer: Question: How to find ? Given the nonlinear activation functions, how to find the optimal set of weights and offsets ??,?? ?= ?1+ ?1?0 ?1 ? ?1 ?= ?1 ?1 ? Middle layer: ? ? ??+1 = ??+1+ ??+1?? ??+1 = ??+1 ??+1 ? ? Top layer: ?= ??+ ???? 1 ??= ?? ?? ? ?? ??,?? = argmin ??,?? ? ?1:?,?1:? ? ?1:?,?1:? ??,?? = ?=1 Representation Power ? ??= ? ?0

  5. Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback

  6. Weight Space Symmetries Flip the sign of weights pointing to & from a single node output unchanged ? ? ? ? ? ? ??+1 ??+1 1 1 ??+1 ??+1 2 2 ??+1 ??+1 ??+1 ??+1 ? ? ? ? ? ? ??+1 ??+1 1 1 ??+1 ??+1 2 2 ??+1 ??+1 ??+1 ??+1 ? ? ? ? ? ? 1 1 ?? ?? 1 1 ?? ?? 2 2 ?? ?? ?? ?? ? ? ? ? ? ? ?? ?? 1 1 ?? ?? 2 2 ?? ?? ?? ?? ? ? ? ? ? ? 1 1 ?? 1 ?? 1 1 1 ?? 1 ?? 1 2 2 ?? 1 ?? 1 ?? 1 ?? 1

  7. Weight Space Symmetries Flip the sign of weights pointing to & from a single node output unchanged Swap the weights pointing to & from a pair of nodes output unchanged At least 2?? combinations of sign flips, and ??! permutations of nodes for a SINGLE layer! ? ? ? ??+1 1 ??+1 2 ??+1 ??+1 ? ? ? ??+1 1 ??+1 2 ??+1 ??+1 ? ? ? 1 ?? 1 ?? 2 ?? ?? ? ? ? ?? 1 ?? 2 ?? ?? ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1

  8. Weight Space Symmetries Flip the sign of weights pointing to & from a single node output unchanged Swap the weights pointing to & from a pair of nodes output unchanged At least 2?? combinations of sign flips, and ??! permutations of nodes for a SINGLE layer! Duda, Hart, Stork: Pattern Classification

  9. Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback

  10. Error back propagation algorithm Matrix representation ?? ???,: : all weights pointing to ?? ??:,? : all weights pointing from ?? ???,? : all weights connecting ?? ?? ? ?= ??+ ???? 1 ? = ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??+1 ??+1 ??+1 ?? 1 1 1 1 ??+1 ??+1 ??+1 ?? 2 2 2 2 ??+1 ??+1 ??+1 ?? ??+1 ??+1 ??+1 ?? ? ? and ? ? ? ? ? ? ? ? ? ? ? ? ??+1 ??+1 ??+1 ?? 1 1 1 1 ??+1 ??+1 ??+1 ?? 2 2 2 2 ??+1 ??+1 ??+1 ?? ??+1 ??+1 ??+1 ?? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 1 1 1 ?? ?? ?? ?? 1 1 1 1 1 ?? ?? ?? ?? 1 2 2 2 2 ?? ?? ?? ?? ?? ?? ?? 1 ?? 1

  11. Error back propagation algorithm Basic idea: gradient descent ???,? ???,? + ?????,? Compute ????,? = Define ??? ?= ????,? ? ?? ?= ??? = ??? ? ? ? ? ? ? Then we have ??? ?= ?? ??? = ? ??? 1 ??? ??? ? ? Compute ????,? ? ?? ????,? ? = ??? ??? 1 ? ? ?= ??+ ???? 1 ? ? ??? = ? ? ? ?? 1 ? ?? ? How to compute ??

  12. Error back propagation algorithm ?? At the top layer: ??= ?? ?? ?? ? ? ?? ?= ??? ? ?? ? ?? ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1

  13. Error back propagation algorithm In the middle layers: ? ? ? ? ? ? ??+1 ??+1 1 1 ??+1 ??+1 2 2 ??+1 ??+1 ??+1 ??+1 Given ??+1 ? ? ??? = ? ? ? ? ? ? ? ??+1 ??+1 1 1 ??+1 ??+1 2 2 ??+1 ??+1 ??+1 ??+1 ??+1?,? ??+1? ? = ??+1:,????+1 ??? ? ? ? ? ? ? 1 1 ?? ?? 1 1 ?? ?? 2 2 ?? ?? ?? ?? ? ? ?= ??+1 ??+1 ? ? ? ? ? ? Then ?? ?? 1 1 ?? ?? 2 2 ?? ?? ?? ?? ? ?? ? ? ? ??= ?? ??? ?? ? = ?? ??+1 ??+1 ? ? ? ? ? ? 1 1 ?? 1 ?? 1 1 1 ?? 1 ?? 1 2 2 ?? 1 ?? 1 ?? 1 ?? 1 - element-wise multiplication

  14. Error back propagation algorithm Step 1: Forward propagation ??+1 = ??+1+ ??+1?? ??+1 = ??+1 ??+1 Step 2: Error back propagation ?? ?= ?? ? ? ? ?? ??? ? ? ? ? ?? ? ??= ?? ??+1 ??+1 . ?= ??+ ???? 1 ??= ?? ?? ?1:?,?1:? ? ?? Step 3: Gradient calculation ??? = ? ??? 1 ? ? ??? ? ??? = ??? Complexity? ? #weights #tokens Step 4: Gradient descent ?? ??+ ???? ?? ??+ ????

  15. Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback

  16. What is a feed-forward network ??5 Given a set of ORDERED nodes ??1 ,??2 , ,??? ??4 ??? = ? ??? ? ?,? ??? = ? ? ? ? ? ? : origin of ? a set of nodes that have weights pointing TO ??? ? ? : destination of ? a set of nodes that have weights pointing FROM ??? Feed-forward network requires ? ? ?:? ? Is neural network a kind of feed-forward network? ??3 ??2 ??1

  17. Back Propagation ??5 Similarly, define ??? = ???? ? ??4 ??? = ? ??? ? ?,? ??? ? ? ? Then ??3 ?? ?,? ?= ??? ??? Complexity? ? #weights #tokens ??2 ??1

  18. Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback

  19. What would be the possible problems? Converging to a local optimum Stagnant problem ? ?? ? ??= ?? ??+1 ??? ??+1 ? ? ??? = ??? ? ??? = ??? Similar to nonlinear MSE classifer?

  20. Ways to improve convergence Caveats of initializing weights Momentum

  21. Initializing weights Initial weights cannot be too large. Why? Avoid stagnant problem. ? ?? ?= ??+1+ ???? 1 ? ? = 1 + ? ? ? ? = 1 + ? ? 2 ? ??= ?? ?? ??+1 ??+1 ? 1 ? ?

  22. Initializing weights Can I initialize all the weights to 0? Weights cannot have uniform values. Otherwise hidden nodes would become indistinguishable. Do not make any pair of hidden nodes indistinguishable. ? ? ? ??+1 1 ??+1 2 ??+1 ??+1 ? ? ? ??+1 1 ??+1 2 ??+1 ??+1 ? ? ? 1 ?? 1 ?? 2 ?? ?? ? ? ? ?? 1 ?? 2 ?? ?? ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1

  23. Initializing weights Generate random small weights. What does small mean? ?= ??+1+ ???? 1 Normalize input feature ? within a small range ?? 1 0.5 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 0.25 0.2 0.15 0.1 0.05 0 -5 -4 -3 -2 -1 0 1 2 3 4 5

  24. Ways to improve convergence Caveats of initializing weights Momentum

  25. Momentum Likely to encounter plateaus along the convergence track Convergence track ?? ??+ ??? ?? ??? 1 ? ??? Why need ?? Damp the vibration

  26. Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback

  27. Feature extraction What does a length-22 FFT looks like? 1 0.8 ? 10 ? 3 0.6 0.4 ? 2 ? 11 Imaginary Part 0.2 22 ? 12 ? 1 0 -0.2 ? 22 ? 13 -0.4 ? 14 ? 21 -0.6 -0.8 -1 -1 -0.5 0 0.5 1 Real Part ? 1 ,? 2 , ,? 10 ,? 11 ,? 12 ,? 11 ,? 10 , ? 2

  28. Feature extraction ? 1 ,? 2 , ,? 10 ,? 11 ,? 12 ,? 11 ,? 10 , ? 2 ? 1 - 0 frequency, DC component. We would like to discard. For audio signal, DC component is 0 log ? 1 Due to noise, log ? 1 Very negative, and highly volatile: log 0.1 = 4.6, : log 1? 10 = 23 Contain no information about speech; only volatile garbage. ? 12 - Nyquist frequency We would like to discard. Speech is band-limited signal. ? 12 contains almost no speech energy but only noise. = = log ?

  29. Dimension-wise Classification? Scheme 1: calculate the joint likelihood, perform ML classification. Scheme 2: calculate the likelihood of each dimension, perform ML classification of each dimension. Pick the majority vote. Two dimension classification example 6 0.2 0.15 4 0.1 2 0.05 0 0 -0.05 -2 -0.1 -4 -0.15 -6 -0.2 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

  30. MMI training ? = log??|???|?? = log???? + log??|???|?? log???? ? log??|???|?? log ??0 ??|???|0 + ??1 ??|???|1 ? = ? ?? ??0 ? ? = ??0 log??|???|?? ??0 log ??0 ??|???|0 + ??1 ??|???|1 ? ? ? ? ??0 ? ??0 log??|???|?? = ??0log??|???|0 ? ?:??=0 log ??0 ??|???|0 + ??1 ??|???|1 ? ? = ??0log ??0 ??|???|0 + ??1 ??|???|1 ?

  31. MMI training ? ??0log ??0 ??|???|0 + ??1 ??|???|1 ? ??0 ??0 ??|???|0 + ??1 ??|???|1 ??0 ??|???|0 = An important result: ? ??? ? = ?? ? ?log? ? ??0 ??|???|0 ?log? ? ?? ?log? ? ?? = ? ? ? = ??0log??|???|0 ??0 ??|???|0 + ??1 ??|???|1 ? ??0log??|???|0 = ??0 Final result: ?? ??0 ? ? = ??0log??|???|0 ? ??0log??|???|0 ??0 ??0log??|???|0 ?:??=0 1 ??0 ? ? = ??0 ??0log??|???|0 ?:??=0 ?:??=1

Related


More Related Content