Understanding Neural Network Training and Structure
This text delves into training a neural network, covering concepts such as weight space symmetries, error back-propagation, and ways to improve convergence. It also discusses the layer structures and notation of a neural network, emphasizing the importance of finding optimal sets of weights and offsets for nonlinear activation functions. Additionally, it explores the symmetries in weight spaces and the permutations of nodes in neural network layers.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Training a Neural Network Yang Zhang
Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback
Neural Network: Recap & Notation ?? Bottom layer: ?= ?1+ ?1?0 ?1 ? ?1 ? ?= ?1 ?1 ? ?? Middle layer: ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1 ? ? ??+1 = ??+1+ ??+1?? ??+1 = ??+1 ??+1 ? ? ? ? ? ??+1 1 ??+1 2 ??+1 ??+1 Top layer: ?= ??+ ???? 1 ??= ?? ?? ? ? ? ? ?? ??+1 1 ??+1 2 ??+1 ??+1 ? ? ? ? 1 ?? 1 ?? 2 ?? ?? ? ?1:?,?1:? ??,?? = ? ? ? ?1 1 ?1 2 ?1 ?1 ?=1 Representation Power ? ? ? ? ??= ? ?0 ?1 1 ?1 2 ?1 ?1 ? ? ? 1 ?0 1 ?0 2 ?0 ?0
Neural Network: Recap & Notation Bottom layer: Question: How to find ? Given the nonlinear activation functions, how to find the optimal set of weights and offsets ??,?? ?= ?1+ ?1?0 ?1 ? ?1 ?= ?1 ?1 ? Middle layer: ? ? ??+1 = ??+1+ ??+1?? ??+1 = ??+1 ??+1 ? ? Top layer: ?= ??+ ???? 1 ??= ?? ?? ? ?? ??,?? = argmin ??,?? ? ?1:?,?1:? ? ?1:?,?1:? ??,?? = ?=1 Representation Power ? ??= ? ?0
Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback
Weight Space Symmetries Flip the sign of weights pointing to & from a single node output unchanged ? ? ? ? ? ? ??+1 ??+1 1 1 ??+1 ??+1 2 2 ??+1 ??+1 ??+1 ??+1 ? ? ? ? ? ? ??+1 ??+1 1 1 ??+1 ??+1 2 2 ??+1 ??+1 ??+1 ??+1 ? ? ? ? ? ? 1 1 ?? ?? 1 1 ?? ?? 2 2 ?? ?? ?? ?? ? ? ? ? ? ? ?? ?? 1 1 ?? ?? 2 2 ?? ?? ?? ?? ? ? ? ? ? ? 1 1 ?? 1 ?? 1 1 1 ?? 1 ?? 1 2 2 ?? 1 ?? 1 ?? 1 ?? 1
Weight Space Symmetries Flip the sign of weights pointing to & from a single node output unchanged Swap the weights pointing to & from a pair of nodes output unchanged At least 2?? combinations of sign flips, and ??! permutations of nodes for a SINGLE layer! ? ? ? ??+1 1 ??+1 2 ??+1 ??+1 ? ? ? ??+1 1 ??+1 2 ??+1 ??+1 ? ? ? 1 ?? 1 ?? 2 ?? ?? ? ? ? ?? 1 ?? 2 ?? ?? ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1
Weight Space Symmetries Flip the sign of weights pointing to & from a single node output unchanged Swap the weights pointing to & from a pair of nodes output unchanged At least 2?? combinations of sign flips, and ??! permutations of nodes for a SINGLE layer! Duda, Hart, Stork: Pattern Classification
Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback
Error back propagation algorithm Matrix representation ?? ???,: : all weights pointing to ?? ??:,? : all weights pointing from ?? ???,? : all weights connecting ?? ?? ? ?= ??+ ???? 1 ? = ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??+1 ??+1 ??+1 ?? 1 1 1 1 ??+1 ??+1 ??+1 ?? 2 2 2 2 ??+1 ??+1 ??+1 ?? ??+1 ??+1 ??+1 ?? ? ? and ? ? ? ? ? ? ? ? ? ? ? ? ??+1 ??+1 ??+1 ?? 1 1 1 1 ??+1 ??+1 ??+1 ?? 2 2 2 2 ??+1 ??+1 ??+1 ?? ??+1 ??+1 ??+1 ?? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 1 1 1 ?? ?? ?? ?? 1 1 1 1 1 ?? ?? ?? ?? 1 2 2 2 2 ?? ?? ?? ?? ?? ?? ?? 1 ?? 1
Error back propagation algorithm Basic idea: gradient descent ???,? ???,? + ?????,? Compute ????,? = Define ??? ?= ????,? ? ?? ?= ??? = ??? ? ? ? ? ? ? Then we have ??? ?= ?? ??? = ? ??? 1 ??? ??? ? ? Compute ????,? ? ?? ????,? ? = ??? ??? 1 ? ? ?= ??+ ???? 1 ? ? ??? = ? ? ? ?? 1 ? ?? ? How to compute ??
Error back propagation algorithm ?? At the top layer: ??= ?? ?? ?? ? ? ?? ?= ??? ? ?? ? ?? ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1
Error back propagation algorithm In the middle layers: ? ? ? ? ? ? ??+1 ??+1 1 1 ??+1 ??+1 2 2 ??+1 ??+1 ??+1 ??+1 Given ??+1 ? ? ??? = ? ? ? ? ? ? ? ??+1 ??+1 1 1 ??+1 ??+1 2 2 ??+1 ??+1 ??+1 ??+1 ??+1?,? ??+1? ? = ??+1:,????+1 ??? ? ? ? ? ? ? 1 1 ?? ?? 1 1 ?? ?? 2 2 ?? ?? ?? ?? ? ? ?= ??+1 ??+1 ? ? ? ? ? ? Then ?? ?? 1 1 ?? ?? 2 2 ?? ?? ?? ?? ? ?? ? ? ? ??= ?? ??? ?? ? = ?? ??+1 ??+1 ? ? ? ? ? ? 1 1 ?? 1 ?? 1 1 1 ?? 1 ?? 1 2 2 ?? 1 ?? 1 ?? 1 ?? 1 - element-wise multiplication
Error back propagation algorithm Step 1: Forward propagation ??+1 = ??+1+ ??+1?? ??+1 = ??+1 ??+1 Step 2: Error back propagation ?? ?= ?? ? ? ? ?? ??? ? ? ? ? ?? ? ??= ?? ??+1 ??+1 . ?= ??+ ???? 1 ??= ?? ?? ?1:?,?1:? ? ?? Step 3: Gradient calculation ??? = ? ??? 1 ? ? ??? ? ??? = ??? Complexity? ? #weights #tokens Step 4: Gradient descent ?? ??+ ???? ?? ??+ ????
Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback
What is a feed-forward network ??5 Given a set of ORDERED nodes ??1 ,??2 , ,??? ??4 ??? = ? ??? ? ?,? ??? = ? ? ? ? ? ? : origin of ? a set of nodes that have weights pointing TO ??? ? ? : destination of ? a set of nodes that have weights pointing FROM ??? Feed-forward network requires ? ? ?:? ? Is neural network a kind of feed-forward network? ??3 ??2 ??1
Back Propagation ??5 Similarly, define ??? = ???? ? ??4 ??? = ? ??? ? ?,? ??? ? ? ? Then ??3 ?? ?,? ?= ??? ??? Complexity? ? #weights #tokens ??2 ??1
Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback
What would be the possible problems? Converging to a local optimum Stagnant problem ? ?? ? ??= ?? ??+1 ??? ??+1 ? ? ??? = ??? ? ??? = ??? Similar to nonlinear MSE classifer?
Ways to improve convergence Caveats of initializing weights Momentum
Initializing weights Initial weights cannot be too large. Why? Avoid stagnant problem. ? ?? ?= ??+1+ ???? 1 ? ? = 1 + ? ? ? ? = 1 + ? ? 2 ? ??= ?? ?? ??+1 ??+1 ? 1 ? ?
Initializing weights Can I initialize all the weights to 0? Weights cannot have uniform values. Otherwise hidden nodes would become indistinguishable. Do not make any pair of hidden nodes indistinguishable. ? ? ? ??+1 1 ??+1 2 ??+1 ??+1 ? ? ? ??+1 1 ??+1 2 ??+1 ??+1 ? ? ? 1 ?? 1 ?? 2 ?? ?? ? ? ? ?? 1 ?? 2 ?? ?? ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1
Initializing weights Generate random small weights. What does small mean? ?= ??+1+ ???? 1 Normalize input feature ? within a small range ?? 1 0.5 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 0.25 0.2 0.15 0.1 0.05 0 -5 -4 -3 -2 -1 0 1 2 3 4 5
Ways to improve convergence Caveats of initializing weights Momentum
Momentum Likely to encounter plateaus along the convergence track Convergence track ?? ??+ ??? ?? ??? 1 ? ??? Why need ?? Damp the vibration
Skeleton Recap sigmoidal neural network Weight space symmetries Error back propagation Generalization to feed-forward network Ways to improve convergence Homework/lab 1 feedback
Feature extraction What does a length-22 FFT looks like? 1 0.8 ? 10 ? 3 0.6 0.4 ? 2 ? 11 Imaginary Part 0.2 22 ? 12 ? 1 0 -0.2 ? 22 ? 13 -0.4 ? 14 ? 21 -0.6 -0.8 -1 -1 -0.5 0 0.5 1 Real Part ? 1 ,? 2 , ,? 10 ,? 11 ,? 12 ,? 11 ,? 10 , ? 2
Feature extraction ? 1 ,? 2 , ,? 10 ,? 11 ,? 12 ,? 11 ,? 10 , ? 2 ? 1 - 0 frequency, DC component. We would like to discard. For audio signal, DC component is 0 log ? 1 Due to noise, log ? 1 Very negative, and highly volatile: log 0.1 = 4.6, : log 1? 10 = 23 Contain no information about speech; only volatile garbage. ? 12 - Nyquist frequency We would like to discard. Speech is band-limited signal. ? 12 contains almost no speech energy but only noise. = = log ?
Dimension-wise Classification? Scheme 1: calculate the joint likelihood, perform ML classification. Scheme 2: calculate the likelihood of each dimension, perform ML classification of each dimension. Pick the majority vote. Two dimension classification example 6 0.2 0.15 4 0.1 2 0.05 0 0 -0.05 -2 -0.1 -4 -0.15 -6 -0.2 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
MMI training ? = log??|???|?? = log???? + log??|???|?? log???? ? log??|???|?? log ??0 ??|???|0 + ??1 ??|???|1 ? = ? ?? ??0 ? ? = ??0 log??|???|?? ??0 log ??0 ??|???|0 + ??1 ??|???|1 ? ? ? ? ??0 ? ??0 log??|???|?? = ??0log??|???|0 ? ?:??=0 log ??0 ??|???|0 + ??1 ??|???|1 ? ? = ??0log ??0 ??|???|0 + ??1 ??|???|1 ?
MMI training ? ??0log ??0 ??|???|0 + ??1 ??|???|1 ? ??0 ??0 ??|???|0 + ??1 ??|???|1 ??0 ??|???|0 = An important result: ? ??? ? = ?? ? ?log? ? ??0 ??|???|0 ?log? ? ?? ?log? ? ?? = ? ? ? = ??0log??|???|0 ??0 ??|???|0 + ??1 ??|???|1 ? ??0log??|???|0 = ??0 Final result: ?? ??0 ? ? = ??0log??|???|0 ? ??0log??|???|0 ??0 ??0log??|???|0 ?:??=0 1 ??0 ? ? = ??0 ??0log??|???|0 ?:??=0 ?:??=1