Understanding Artificial Neural Networks From Scratch
Learn how to build artificial neural networks from scratch, focusing on multi-level feedforward networks like multi-level perceptrons. Discover how neural networks function, including training large networks in parallel and distributed systems, and grasp concepts such as learning non-linear functions and performing logic operations like XOR. Dive into the structure of neural networks, the training process, gradient descent approaches, stochastic gradient descent, and the algorithm for learning neural networks.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Artificial Neural Networks from Scratch Learn to build neural network from scratch. oFocus on multi-level feedforward neural networks (multi-level perceptrons) Training large neural networks is one of the most important workload in large scale parallel and distributed systems oProgramming assignments throughout the semester will use this.
What do (deep) neural networks do? Learning (highly) non-linear functions. ?? ?? ?? ?? 0 0 0 (0, 1) (1, 1) 0 1 1 1 0 1 1 1 0 (1, 0) (0, 0) Logic XOR ( ) operation
Artificial neural network example A neural network consists of layers of artificial neurons and connections between them. Each connection is associated with a weight. Training of a neural network is to get to the right weights (and biases) such that the error across the training data is minimized. Input layer Hidden layer 1 Hidden layer 2 Hidden layer 3 Output layer
Training a neural network A neural network is trained with m training samples (?(1),?(1)), (?(2),?(2)), (?(?),?(?)) ?(?) is an input vector, ?(?) is an output vector Training objective: minimize the prediction error (loss) ? (?? ??(??))2 ??? ?=1 ??(??) is the predicted output vector for the input vector ?(?) Approach: Gradient descent (stochastic gradient descent, batch gradient descent, mini-batch gradient descent). o Use error to adjust the weight value to reduce the loss. The adjustment amount is proportional to the contribution of each weight to the loss Given an error, adjust the weight a little to reduce the error.
Stochastic gradient descent Given one training sample (?(?),?(?)) Compute the output of the neural network ??(??) Training objective: minimize the prediction error (loss) there are different ways to define error. The following is an example: ? =1 2(?? ??(??))2 Estimate how much each weight ?? in ? contributes to the error: ?? ??? ?? ???. Here is the learning rate. Update the weight ?? by ??= ??
Algorithm for learning artificial neural network Initialize the weights ? = [?0,?1, ,??] Training o For each training data (??,?(?)), Using forward propagation to compute the neural network output vector ??(??) o Compute the error ? (various definitions) ?? ??? for each weight ?? o Use backward propagation to compute ?? ??? o Update ??= ?? oRepeat until E is sufficiently small.
A single neuron b (?) w1 ?1 Neuron wm (?) ?? Activation function (?)+ + ?? ?? (?)+b ?1 ?1 An artificial neuron has two components: (1) weighted sum and activation function. oMany activation functions: Sigmoid, ReLU, etc.
Sigmoid function 1 ? = ??????? ? = 1+? ? The derivative of the sigmoid ? ????????? ? = function: ??????? ? 1 ??????? ? ? ?? ? = ? (1 ? ) ? =
Training for the logic AND with a single neuron In general, one neuron can be trained to realize a linear function. Logic AND function is a linear function: ?? ?? ?? ?? 0 0 0 (0, 1) (1, 1) 0 1 0 1 0 0 1 1 1 (1, 0) (0, 0) Logic AND ( ) operation
Training for the logic AND with a single neuron b=0 ?1=0 ?1=0 Sigmoid(s) O=Sigmoid(0)=0.5 ?2=0 ?2= 1 Activation function ? = ?1?1+?2?2+ ? = 0 Consider training data input (?1=0, ?2= 1), output Y=0. NN Output = 0.5 1 2(? ?)2= 0.125 Error: ? = To update ?1, ?2, and b, gradient descent needs to compute ?? ??1 , ?? ??2, and ?? ??
Chain rules for calculating ?? ??1 , ?? ??2, and ?? ?? b=0 ?1=0 ?1=0 ? = ?1?1+?2?2+ ? = 0 Sigmoid(s) O=Sigmoid(0)=0.5 ?2=0 ?2= 1 Activation function If a variable z depends on the variable y, which itself depends on the variable x, then z depends on x as well, via the intermediate variable y. The chain rule is a formula that expresses the derivative as : ?? ?? ?? ?? ?? ??= ?? ??1=?? ?? ?? ?? ??1 ??
Training for the logic AND with a single neuron b=0 ?1=0 ?1=0 Sigmoid(s) O=Sigmoid(0)=0.5 ?2=0 ?2= 1 Activation function ? = ?1?1+?2?2+ ? = 0 ?(1 2(? ?)2) ?? ?? ??1=?? ?? ?? ?? ??1 ?? ?? = = O Y = 0.5 0 = 0.5 ?? ?? ??1 = ?(?1?1+?2?2+?) ?? ?? = ?(??????? ? ) = ?1 = 0 = sigmoid(s) (1-sigmoid(s)) = 0.5 (1-0.5) = 0.25, ?? ??1 ?? ??1 = 0 0.1*0.5*0.25*0 = 0 To update ?1: ?1= ?1 ???? Assume rate = 0.1
Training for the logic AND with a single neuron b=0 ?1=0 ?1=0 Sigmoid(s) O=Sigmoid(0)=0.5 ?2=0 ?2= 1 Activation function ? = ?1?1+?2?2+ ? = 0 ?(1 2(? ?)2) ?? ?? ??2=?? ?? ?? ?? ??2 ?? ?? = = O Y = 0.5 0 = 0.5 ?? ?? ??2 = ?(?1?1+?2?2+?) ?? ?? = ?(??????? ? ) = ?2 = 1 = sigmoid(s) (1-sigmoid(s)) = 0.5 (1-0.5) = 0.25, ?? ??2 ?? ??2 = 0 0.1*0.5*0.25*1 = -0.0125 To update ?2: ?2= ?2 ????
Training for the logic AND with a single neuron b=0 ?1=0 ?1=0 Sigmoid(s) O=Sigmoid(0)=0.5 ?2=0 ?2= 1 Activation function ? = ?1?1+?2?2+ ? = 0 ?(1 2(? ?)2) ?? ?? ??=?? ?? ?? ?? ?? ?? ?? = = O Y = 0.5 0 = 0.5 ?? ?? = ?(?1?1+?2?2+?) ?? ?? = ?(??????? ? ) To update b: b= ? ???? ?? = sigmoid(s) (1-sigmoid(s)) = 0.5 (1-0.5) = 0.25, ?? = 1 ?? ?? ?? = 0 0.1*0.5*0.25*1 = -0.0125
Training for the logic AND with a single neuron b=-0.0125 ?1=0 ?1=0 Sigmoid(s) O=Sigmoid(0)=0.5 ?2= 1 ?2=-0.0125 Activation function ? = ?1?1+?2?2+ ? = 0 This process is repeated until the error is sufficiently small The initial weight should be randomized. Gradient descent can get stuck in the local optimal. See lect7/one.cpp for training the logic AND operation with a single neuron. Note: Logic XOR operation is non-linear and cannot be trained with one neuron.
Multi-level feedforward neural networks A multi-level feedforward neural network is a neural network that consists of multiple levels of neurons. Each level can have many neurons and connections between neurons in different levels do not form loops. o Information moves in one direction (forward) from input nodes, through hidden nodes, to output nodes. One artificial neuron can only realize a linear function Many levels of neurons can combine linear functions can train arbitrarily complex functions. o One hidden layer (with infinite number of neurons) can train for any continuous function.
Multi-level feedforward neural networks examples A layer of neurons that do not directly connect to outputs is called a hidden layer. Input layer Hidden layer 1 Hidden layer 2 Hidden layer 3 Output layer
Build a 3-level neural network from scratch 3 levels: Input level, hidden level, output level oOther assumptions: fully connected between layers, all neurons use sigmoid ( ) as the activation function. Notations: oN0: size of the input level. Input: ?? ?0 = [??1,??2, ,????] oN1: size of the hidden layer oN2: size of the output layer. Output: OO ?2 = [???1,???2, ,????2]
Build a 3-level neural network from scratch Notations: oN0, N1, N2: sizes of the input layer, hidden layer, and output layer, respectively oN0 N1 weights from input layer to hidden layer. ?0??: the weight from input unit i to hidden unit j. B0[N1] biases. B0 ?1 = [?01,?02, ,?0?1] oN1 N2 weights from hidden layer to output layer. ?1??: the weight from hidden unit i to output unit j. B1[N2] biases. B1 ?2 = [?11,?12, ,?1?2] ?01,1 ?0?0,1 ?01,?1 ?0?0,?1 ?01,1 ?0?1,1 ?01,?2 ?0?1,?2 o ?0 ?0 ?1 = , ?1 ?1 ?2 =
3-level feedforward neural network Output: OO[N2] Hidden layer biases: B2[N2] Output layer 1 2 N2 Weight: W1[N1][N2] Hidden layer Output: HO[N1] Hidden layer biases: B1[N1] 1 2 N1 Hidden layer Weight: W0[N0][N1] 1 2 N0 Input layer Input: IN[N0] Hidden layer weighted sum: HS[N1] Output layer weighted sum: HS[N2]
Forward propogation (compute OO and E) Compute hidden layer weighted sum: HS ?1 = [??1,??2, ,???1] o???= ??1 ?01,?+ ??2 ?02,?+ + ???0 ?0?0,?+ ?1? o In matrix form: ?? = ?? ?0 + ?1 Compute hidden layer output: HO ?1 = [??1,??2, ,???1] o???= (???) o In matrix form: ?? = (HS)
Forward propogation From input (IN[N0]), compute output (OO[N2]) and error E. Compute output layer weighted sum: OS ?2 = [??1,??2, ,???2] o???= ??1 ?11,?+ ??2 ?12,?+ + ???1 ?1?1,?+ ?2? o In matrix form: ?? = ?? ?1 + ?2 Compute final output: OO ?2 = [??1,??2, ,???1] o???= (???) o In matrix form: O? = (OS) 1 ?2(??? ??)2 Let us use mean square error: ? = ?2 ?=1
Backward propogation ?? ?? ?? ??1?, and ?? ??2?. To goal is to compute ??0?,?, ??1?,?, ?? ???= ?? ???1, 2 ?2???2 ??2] ?? ???2, , ?? = [2 2 ?2( ?2??1 ?1, ??2 ????2 ) ?2, , In matrix form: ?? ??? = 2 ?2(?? ?) This can be stored in an array dE_OO[N2];
Backward propogation ?? ?? ?? ??1?, and ?? ??2?. To goal is to compute ??0?,?, ??1?,?, ?? ??? is done ?? ???= ?? ???1 ???1 ???1, ?? ???2 ???2 ???2, , ?? ????2 ????2 = ????2 ?? ???1 (??1)(1 (??1)), , In matrix form: ?? ?? ????2 (???2)(1 (???2)) ??? = 2 ?2(?? ?) ?? (1 ??) This can be stored in an array dE_OS[N2];
Backward propogation ?? ?? ?? ??1?, and ?? ??2?. To goal is to compute ??0?,?, ??1?,?, ?? ???, ?? ??2= ?? ??? are done ?? ???1 ???1 ??21, ?? ???2 ???2 ??22, , ?? ????2 ??2?2 ????2 ???= ??1 ?11,?+ ??2 ?12,?+ + ???1 ?1?1,?+ ?2? Hence, ???? ??2?= 1. ?? ??2= ?? ???1, ?? ???2, , ?? ?? ??? = ????2
Backward propogation ?? ?? ?? ??1?, and ?? ??2?. To goal is to compute ??0?,?, ??1?,?, ??? ?? ???, ?? ???, ?? ??2 are done i ?11,? ?1?2,? ?? ???1 ???1 ??11,1 ???1 ??1?1,1 ?? ????2 ??11,?2 ????2 ??1?1,,?2 ????2 ?? ??1= ?? ???1 ?? ????2 ???= ??1 ?11,?+ ??2 ?12,?+ + ???1 ?1?1,?+ ?2? ???? ??1?,?= ???. Hence,
Backward propogation ?? ?? ?? ??1?, and ?? ??2?. To goal is to compute ??0?,?, ??1?,?, ?? ???, ?? ???, ?? ??2 are done ?? ???1 ???1 ??11,1 ???1 ??1?1,1 ?? ????2 ??11,?2 ????2 ??1?1,,?2 ?? ???1??1 ?? ???1???1 ?? ????2??1 ?? ????2???1 ????2 ?? ??1= = ?? ???1 ?? ????1 ?? ??1= ????? In matrix form: ???
Backward propogation ?? ?? ?? ??1?, and ?? ??2?. To goal is to compute ??0?,?, ??1?,?, ?? ???, ?? ???, ?? ?? ??1 are done ??2, ?? ???= [ ?? ?? ?? ???1, ???2, , ????1] ?? ???? = ?? ???2?1?,2+ + ?? ???1 ???1 ???? + ?? ???2 ???2 ????+ + ?? ????2 ???? = ?? ???1?1?,1 + ????2 ?? ????2?1?,?2
Backward propogation ?? ?? ?? ??1?, and ?? ??2?. To goal is to compute ??0?,?, ??1?,?, ?? ???, ?? ???, ?? ?? ??1 are done ??2, ?? ???= [ ?? ?? ????1] = ?? ?? ????1? ???1, ???2, ,
Backward propogation ?? ?? ?? ??1?, and ?? ??2?. To goal is to compute ??0?,?, ??1?,?, ?? ???, ?? ???, ?? ?? ??1 , ?? ??2, ??? are done ?? ??? is computed, we can repeat the process for the hidden layer by replacing OO with HO, OS with HS, B2 with B1 and W2 with W1, in the differential equation. Also the input is IN[N0] and the output is HO[N1]. Once
Summary H2 H1 layer Layer 3 IN O Layer 2 Y Layer 1 X ?? ??? ?? ?? layer The output of a layer is the input of the next layer. Backward propagation uses results from forward propagation. ?? ??? = ?? ????, ?? ??= ????? ??, ?? ??=?? o ??
Training for the logic XOR and AND with a 6-unit 2- level nueral network Logic XOR function is not a linear function (can t train with lect8/one.cpp). See 3level.cpp ?? ?? ?? ?? 0 0 0 (0, 1) (1, 1) AND 0 1 1 1 0 1 XOR 1 1 0 (1, 0) (0, 0) Logic XOR ( ) operation
Summary Briefly discuss multi-level feedforward neural networks The training of neural networks Following 3level.cpp, one should be able to write a program for any multi-level feedforward neural networks.