A Deep Dive into Neural Network Units and Language Models

undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
Units in Neural Networks
 
 
This is in your brain
 
2
 
By BruceBlaus - Own work, CC BY 3.0,
https://commons.wikimedia.org/w/index.php?curid=28761830
Neural Network Unit
This is not in your brain
3
 
 
Weights
 
Input layer
 
Weighted sum
 
Non-linear transform
 
Output value
 
bias
 
Neural unit
 
Take weighted sum of inputs, plus a bias
 
 
 
Instead of just using z, we'll apply a nonlinear activation
function f:
 
Non-Linear Activation Functions
 
5
 
Sigmoid
 
We're already seen the sigmoid for logistic regression:
 
Final function the unit is computing
Final unit again
7
 
 
Weights
 
Input layer
 
Weighted sum
 
Non-linear activation function
 
Output value
 
bias
An example
 
Suppose a unit has:
w 
= [0.2,0.3,0.9]
b 
= 0.5
What happens with input x:
x 
= [0.5,0.6,0.1]
 
 
An example
 
Suppose a unit has:
w 
= [0.2,0.3,0.9]
b 
= 0.5
What happens with the following input x?
  
x 
= [0.5,0.6,0.1]
 
 
An example
 
Suppose a unit has:
w 
= [0.2,0.3,0.9]
b 
= 0.5
What happens with input x:
x 
= [0.5,0.6,0.1]
 
 
An example
 
Suppose a unit has:
w 
= [0.2,0.3,0.9]
b 
= 0.5
What happens with input x:
x 
= [0.5,0.6,0.1]
 
 
Non-Linear 
Activation 
Functions besides sigmoid
 
12
 
tanh
 
ReLU
Rectified Linear Unit
 
Most Common:
undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
Units in Neural Networks
 
undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
The XOR problem
 
 
The XOR problem
 
Can neural units compute simple functions of input?
 
Minsky and Papert (1969)
 
Perceptrons
 
A very simple neural unit
Binary output  (0 or 1)
No non-linear activation function
 
Easy to build AND or OR with perceptrons
 
AND
 
OR
 
Easy to build AND or OR with perceptrons
 
AND
 
OR
 
Easy to build AND or OR with perceptrons
 
AND
 
OR
 
Not possible to capture XOR with perceptrons
 
Pause the lecture and try for yourself!
 
Why? Perceptrons are linear classifiers
 
Perceptron equation given
 x
1
 and 
x
2
, is the equation of a line
 
w
1
x
1
 + 
w
2
x
2
 + 
b 
= 0
 
(in standard linear format:     
x
2
 = (−
w
1
/
w
2
)
x
1
 + (−
b
/
w
2
)    )
 
This line acts as a 
decision boundary
0 if input is on one side of the line
1 if on the other side of the line
 
Decision boundaries
 
XOR is not a 
linearly separable 
function!
 
Solution to the XOR problem
 
XOR 
can't
 be calculated by a single perceptron
XOR 
can
 be calculated by a layered network of units.
 
ReLU
 
ReLU
 
Solution to the XOR problem
 
XOR 
can't
 be calculated by a single perceptron
XOR 
can
 be calculated by a layered network of units.
The hidden representation h
 
(With learning:  hidden layers will learn to form useful representations)
undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
The XOR problem
 
undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
Feedforward Neural Networks
 
 
Feedforward Neural Networks
 
Can also be called 
multi-layer perceptrons 
(or
MLPs
)  for historical reasons
 
Binary Logistic Regression as a 1-layer Network
 
29
 
 
 
 
 
w
 
x
n
 
x
1
 
+1
 
w
1
 
w
n
 
b
 
(y is a scalar)
 
σ
 
Output layer
(
σ node)
 
Input layer
vector x
 
(we don't count the input layer in counting layers!)
 
(vector)
 
(scalar)
 
Multinomial Logistic Regression as a 1-layer Network
 
30
 
 
 
 
 
W
 
x
n
 
x
1
 
Fully connected single layer network
 
W is a
matrix
 
+1
 
y is a vector
 
y
1
 
y
n
 
b is a vector
 
b
 
s
 
s
 
s
 
Output layer
(
softmax nodes)
 
Input layer
scalars
Reminder: softmax: a generalization of sigmoid
 
 
For a vector 
z 
of dimensionality 
k
, the softmax is:
 
 
 
 
Example:
 
Two-Layer Network with scalar output
 
 
 
 
 
U
 
W
 
x
n
 
x
1
 
+1
 
y is a scalar
 
b
 
hidden units
(
σ node)
 
Input layer
(vector)
 
Output layer
(
σ node)
 
Could be ReLU
Or tanh
 
Two-Layer Network with scalar output
 
 
 
 
 
U
 
W
 
x
n
 
x
1
 
+1
 
b
 
hidden units
(
σ node)
 
Input layer
(vector)
 
Output layer
(
σ node)
 
i
 
j
 
W
ji
 
vector
 
y is a scalar
 
Two-Layer Network with scalar output
 
 
 
 
 
U
 
W
 
x
n
 
x
1
 
+1
 
b
 
hidden units
(
σ node)
 
Input layer
(vector)
 
Output layer
(
σ node)
 
Could be ReLU
Or tanh
 
y is a scalar
 
Two-Layer Network with softmax output
 
 
 
 
 
U
 
W
 
x
n
 
x
1
 
+1
 
b
 
hidden units
(
σ node)
 
Input layer
(vector)
 
Output layer
(
σ node)
 
Could be ReLU
Or tanh
 
y is a vector
 
Multi-layer Notation
 
 
 
 
 
W
[1
]
 
x
n
 
x
1
 
+1
 
b
[1]
 
i
 
j
 
W
[2
]
 
b
[2]
 
sigmoid or softmax
 
ReLU
Multi Layer Notation
37
 
Replacing the bias unit
 
Let's switch to a notation without the bias unit
Just a notational change
1.
Add a dummy node a
0
=1 to each layer
2.
Its weight w
0
 will be the bias
3.
So input layer a
[0]
0
=1,
And a
[1]
0
=1 , a
[2]
0
=1,…
 
Replacing the bias unit
 
Instead of:
     
We'll do this:
 
x= x
1
, x
2
, …, x
n0
 
x= x
0
, x
1
, x
2
, …, x
n0
 
Replacing the bias unit
 
Instead of:
     
We'll do this:
undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
Feedforward Neural Networks
 
undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
Applying feedforward networks
to NLP tasks
 
Use cases for feedforward networks
 
Let's consider 2 (simplified) sample tasks:
1.
Text classification
2.
Language modeling
 
State of the art systems use more powerful neural
architectures, but simple models are useful to
consider!
43
 
Classification: Sentiment Analysis
 
We could do exactly what we did with logistic
regression
Input layer are binary features as before
Output layer is 0 or 1
 
σ
 
Sentiment Features
 
45
Feedforward nets for simple classification
 
Just adding a hidden layer to logistic regression
allows the network to use non-linear interactions between features
which may (or may not) improve performance.
46
46
 
 
Logistic
Regression
 
2-layer
 feedforward
 network
 
σ
 
σ
Even better: representation learning
 
The real power of deep learning
 
comes
from the  ability to 
learn
 features from
the data
Instead of using hand-built human-
engineered features for classification
Use learned representations like
embeddings!
47
e
1
e
2
e
n
σ
 
48
 
Neural Net Classification with embeddings as input features!
Issue: texts come in different sizes
 
This assumes a fixed size length (3)!
Kind of unrealistic.
Some simple solutions (more sophisticated solutions later)
1.
Make the input the length of the longest review
If shorter then pad with zero embeddings
Truncate if you get longer reviews at test time
2.
Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
Take the mean of all the word embeddings
Take the element-wise max of all the word embeddings
For each dimension, pick the max value from all words
49
 
Reminder: Multiclass Outputs
 
What if you have more than two output classes?
Add more output units (one for each class)
And use a “softmax layer”
 
50
 
 
 
 
 
U
 
W
 
x
n
 
x
1
Neural Language Models (LMs)
 
Language Modeling
: Calculating the probability of the
next word in a sequence given some history.
We've seen N-gram based LMs
But neural network LMs far outperform n-gram
language models
State-of-the-art neural LMs are based on more
powerful neural network technology like Transformers
But 
simple feedforward LMs 
can do almost as well!
51
Simple feedforward Neural Language Models
 
Task
: predict next word 
w
t
  
  given prior words 
w
t-1
, w
t-2
, w
t-3
, …
Problem
: Now we’re dealing with sequences of
arbitrary length.
Solution
: Sliding windows (of fixed length)
52
 
53
 
Neural Language Model
Why Neural LMs work better than N-gram LMs
 
Training data:
We've seen
:  I have to make sure that the cat gets fed.
Never seen:   
dog gets fed
Test data:
I forgot to make sure that the dog gets ___
N-gram LM can't predict "
fed
"!
Neural LM 
can use similarity of "
cat
" and "
dog
"
embeddings to generalize and predict “
fed
” after 
dog
 
 
 
 
 
undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
Applying feedforward networks
to NLP tasks
 
undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
Training Neural Nets: Overview
 
Intuition: training a 2-layer Network
57
 
 
 
 
U
W
x
n
x
1
Training instance
 
Intuition: Training a 2-layer network
 
58
Reminder: Loss Function for binary logistic regression
A measure for how far off the current answer is to
the right answer
Cross entropy loss for logistic regression:
59
 
Reminder: gradient descent for weight updates
Where did that derivative come from?
 
Using the chain rule!   
f 
(
x
) = 
u
(
v
(
x
))
Intuition (see the text for details)
61
 
Derivative of the Loss
 
Derivative of the Activation
 
Derivative of the weighted sum
 
How can I find that gradient for every weight in
the network?
 
These derivatives on the prior slide only give the
updates for one weight layer: the last one!
What about deeper networks?
Lots of layers, different activation functions?
Solution in the next lecture:
Even more use of the chain rule!!
Computation graphs and backward differentiation!
 
62
undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
Training Neural Nets: Overview
 
undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
Computation Graphs and
Backward Differentiation
 
 
Why Computation Graphs
 
For training, we need the derivative of the loss with
respect to each weight in every layer of the network
But the loss is computed only at the very end of the
network!
Solution: 
error backpropagation 
(Rumelhart, Hinton, Williams, 1986)
Backprop
 is a special case of 
backward differentiation
Which relies on 
computation graphs
.
 
65
 
Computation Graphs
 
A computation graph represents the process of
computing a mathematical expression
 
 
 
66
Example:
67
Computations:
 
Example:
 
68
 
Computations:
 
Backwards differentiation in computation graphs
 
The importance of the computation graph
comes from the backward pass
This is used to compute the derivatives that we’ll
need for the weight update.
 
Example
 
70
 
We want:
The chain rule
 
Computing the derivative of a composite function:
 
f 
(
x
) = 
u
(
v
(
x
))
 
 
f 
(
x
) = 
u(v(w(x)
))
 
Example
72
 
Example
 
73
 
Example
 
74
 
Example
 
75
Backward differentiation on a two layer network
76
 
σ
 
 
 
W
[2]
W
[1]
y
x
2
x
1
Sigmoid activation
ReLU activation
1
1
b
[1]
b
[2]
Backward differentiation on a two layer network
77
 
Backward differentiation on a 2-layer network
 
Summary
 
For training, we need the derivative of the loss with respect to
weights in early layers of the network
But loss is computed only at the very end of the network!
Solution: 
backward differentiation
Given a computation graph and the derivatives of all the
functions in it we can automatically compute the derivative of
the loss with respect to these early weights.
 
80
undefined
 
Simple Neural
Networks and
Neural
Language
Models
 
 
Computation Graphs and
Backward Differentiation
 
Slide Note

Here we introduce the individual computing unit that makes up neural networks.

Embed
Share

Explore the fundamentals of neural network units in language models, discussing computation, weights, biases, and activations. Understand the essence of weighted sums in neural networks and the application of non-linear activation functions like sigmoid, tanh, and ReLU. Dive into the heart of neural networks to grasp the building blocks and principles behind them.

  • Neural Networks
  • Language Models
  • Activation Functions
  • Weighted Sums
  • Computation

Uploaded on Jul 22, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Simple Neural Networks and Neural Language Models Units in Neural Networks

  2. This is in your brain By BruceBlaus - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=28761830 2

  3. Neural Network Unit This is not in your brain y Output value a Non-linear transform z Weighted sum bias Weights Input layer w1 w2 w3 b x1 x2 x3 +1 3

  4. 2 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 7.1 Units Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and producesanoutput. 2 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS At itsheart, aneural unit istaking aweighted sumof itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputsx1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sumzcan be represented as: z= b+ biasterm 7.1 Units X wixi (7.1) Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and i NEURAL NETWORKS AND NEURAL LANGUAGE MODELS producesanoutput. At itsheart, aneural unit istaking aweighted sumof itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputsx1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sumzcan be represented as: z= b+ z= w x+ b 2 CHAPTER 7 Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebrathat avector is, at heart, just alist or array of numbers. Thus vector we ll talk about zintermsof aweight vector w, ascalar biasb, andaninput vector x,andwe ll replacethesumwiththeconvenient dot product: 7.1 Units biasterm X Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and producesanoutput. At itsheart, aneural unit istaking aweighted sum of itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputs x1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sum zcan be represented as: X x,andwe ll replacethesumwiththeconvenient dot product: theactivation valuefor theunit, a. Sincewearejust modeling asingle unit, the activationforthenodeisinfactthefinal outputof thenetwork,whichwe ll generally call y. Sothevalueyisdefinedas: wixi (7.1) (7.2) i Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebrathat avector is, at heart, just alist or array of numbers. Thus we ll talk about zintermsof aweight vector w, ascalar biasb, andaninput vector apply anon-linear function f to z. Wewill refer to theoutput of this function as AsdefinedinEq.7.2,zisjust areal valuednumber. Finally, instead of using z, a linear function of x, as the output, neural units Neural unit biasterm vector Take weighted sum of inputs, plus a bias z= b+ wixi (7.1) activation i z= w x+ b (7.2) Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebra that avector is, at heart, just alist or array of numbers. Thus AsdefinedinEq. 7.2,zisjust areal valuednumber. y= a= f(z) vector Instead of just using z, we'll apply a nonlinear activation function f: we ll talk about zin termsof aweight vector w, ascalar biasb, and an input vector x, andwe ll replacethesumwiththeconvenient dot product: Finally, instead of using z, a linear function of x, as the output, neural units apply anon-linear function f to z. Wewill refer to theoutput of this function as theactivation valuefor theunit, a. Since wearejust modeling asingle unit, the activationforthenodeisinfactthefinal outputof thenetwork,whichwe ll generally call y. Sothevalueyisdefinedas: sigmoid functionsincewesawit inChapter 5: z= w x+ b (7.2) We ll discussthreepopular non-linear functions f() below (thesigmoid, thetanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the activation Asdefined inEq. 7.2, zisjust areal valued number. Finally, instead of using z, a linear function of x, as the output, neural units apply a non-linear function f to z. We will refer to the output of this function as the activation value for the unit, a. Since we are just modeling a single unit, the activationforthenodeisinfact thefinal output of thenetwork,whichwe ll generally call y. Sothevalueyisdefined as: We ll discussthreepopular non-linear functions f() below (thesigmoid, thetanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the sigmoid functionsincewesaw it inChapter 5: into therange[0,1], which isuseful in squashing outliers toward 0 or 1. And it s differentiable, whichaswesawinSection??will behandy for learning. sigmoid activation 1 y= a= f(z) y= s(z) = (7.3) 1+ e z Thesigmoid(showninFig. 7.1) hasanumber of advantages; it mapstheoutput y= a= f(z) sigmoid We ll discuss three popular non-linear functions f() below (the sigmoid, the tanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the sigmoid function sincewesaw it inChapter 5: y= s(z) = 1 (7.3) sigmoid 1+ e z 1 Thesigmoid(showninFig. 7.1) hasanumber of advantages; it mapstheoutput into therange[0,1], which isuseful in squashing outliers toward 0 or 1. And it s differentiable, whichaswesawinSection??will behandy for learning. y= s(z) = (7.3) 1+ e z Thesigmoid (showninFig. 7.1) hasanumber of advantages; it mapstheoutput into the range [0,1], which is useful in squashing outliers toward 0 or 1. And it s differentiable, which aswesaw inSection ??will behandy for learning. Figure7.1 nearly linear around0but outlier valuesget squashedtoward0or 1. The sigmoid function takesa real value and maps it to the range [0,1]. It is Figure7.1 nearly linear around 0but outlier valuesget squashed toward0or 1. nearly linear around0but outlier valuesget squashedtoward0or 1. The sigmoid function takes a real value and maps it to the range [0,1]. It is The sigmoid function takesa real value and maps it to the range [0,1]. It is Figure7.1

  5. 2 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 7.1 Units Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and producesanoutput. At itsheart, aneural unit istaking aweighted sum of itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputs x1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sum zcan be represented as: z= b+ biasterm X wixi (7.1) i Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebra that avector is, at heart, just alist or array of numbers. Thus we ll talk about zin termsof aweight vector w, ascalar biasb, and aninput vector x, andwe ll replacethesumwiththeconvenient dot product: vector z= w x+ b (7.2) Asdefined inEq. 7.2, zisjust areal valued number. Finally, instead of using z, a linear function of x, as the output, neural units apply a non-linear function f to z. We will refer to the output of this function as the activation value for the unit, a. Since we are just modeling a single unit, the activationforthenodeisinfactthefinal output of thenetwork,whichwe ll generally call y. Sothevalueyisdefinedas: activation Non-Linear Activation Functions y= a= f(z) We're already seen the sigmoid for logistic regression: We ll discuss three popular non-linear functions f() below (the sigmoid, thetanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the sigmoid function sincewesaw it inChapter 5: sigmoid Sigmoid 1 y= s (z) = (7.3) 1+ e z Thesigmoid (showninFig. 7.1) hasanumber of advantages; it mapstheoutput into the range [0,1], which is useful in squashing outliers toward 0 or 1. And it s differentiable, whichaswesaw inSection ??will behandy for learning. 5 Figure7.1 nearly linear around0but outlier valuesget squashed toward0or 1. The sigmoid function takes a real value and maps it to the range [0,1]. It is

  6. Final function the unit is computing 7.1 UNITS 3 Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 y= s(w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 and1. Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] b = 0.5 What wouldthisunit dowith thefollowinginput vector: x = [0.5,0.6,0.1] Theresulting output ywouldbe: 1 1 1 y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: tanh y=ez e z ez+ e z (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: ReLU y= max(x,0) (7.6)

  7. Final unit again y Output value a Non-linear activation function z Weighted sum bias Weights Input layer w1 w2 w3 b x1 x2 x3 +1 7

  8. 7.1 UNITS 3 Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: 1 y= s (w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: w = [0.2,0.3,0.9] b = 0.5 An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with input x: x = [0.5,0.6,0.1] What would thisunit dowiththefollowinginput vector: x = [0.5,0.6,0.1] Theresulting output ywouldbe: 1 1 1 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: tanh y=ez e z ez+ e z (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: ReLU y= max(x,0) (7.6)

  9. 7.1 7.1 UNITS UNITS 3 3 Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 1 y= s (w x+ b) = y= s(w x+ b) = (7.4) (7.4) 1+ exp( (w x+ b)) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to Figure7.2 Let swalk through an examplejust to get an intuition. Let ssupposewehavea Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] b = 0.5 b = 0.5 An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with the following input x? x = [0.5,0.6,0.1] Theresulting output ywouldbe: What would thisunit dowiththefollowinginput vector: What wouldthisunit dowith thefollowing input vector: x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] Theresulting output ywouldbe: 1 1 1 1 1 1 y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly used asanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: y=ez e z ez+ e z tanh tanh y=ez e z ez+ e z (7.5) (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: y= max(x,0) ReLU ReLU (7.6) y= max(x,0) (7.6)

  10. 7.1 7.1 UNITS UNITS 3 3 Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 1 y= s (w x+ b) = y= s(w x+ b) = 7.1 (7.4) (7.4) 1+ exp( (w x+ b)) 1+ exp( (w x+ b)) UNITS 3 Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between 0 and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 y= s (w x+ b) = 1+ exp( (w x+ b)) Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: 1 (7.4) Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In this casetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa Figure7.2 Figure7.2 Let swalk through an examplejust to get an intuition. Let ssupposewehavea Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] b = 0.5 b = 0.5 Let swalk through an examplejust toget an intuition. Let ssupposewehavea unit with thefollowingweight vector andbias: An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with input x: x = [0.5,0.6,0.1] y= s (w x+ b) = w = [0.2,0.3,0.9] What would thisunit dowiththefollowinginput vector: What wouldthisunit dowith thefollowing input vector: b = 0.5 x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] What wouldthisunit dowith thefollowinginput vector: Theresulting output ywouldbe: Theresulting output ywouldbe: x = [0.5,0.6,0.1] 1 1 1 1 1 1 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Theresulting output ywouldbe: y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 1+ e 0.87= .70 1 1 1 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= Inpractice, thesigmoid isnot commonly used asanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: y=ez e z ez+ e z y=ez e z ez+ e z tanh tanh Inpractice, thesigmoid isnot commonly usedasanactivation function. A function tanh y=ez e z ez+ e z (7.5) (7.5) (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: y= max(x,0) y= max(x,0) ReLU Thesimplest activation function, and perhapsthemost commonly used, istherec- ReLU ReLU (7.6) y= max(x,0) (7.6) (7.6)

  11. 7.1 UNITS 3 7.1 7.1 UNITS UNITS 3 3 7.1 UNITS 3 Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2 into Eq. 7.3givesustheoutput of aneural unit: y= s (w x+ b) = 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoid functiontoresult inanumber between 0 and1. 1 1 (7.4) 1 y= s (w x+ b) = y= s(w x+ b) = 7.1 (7.4) (7.4) 1+ exp( (w x+ b)) 1+ exp( (w x+ b)) UNITS 3 (7.4) 1 y= s (w x+ b) = Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between 0 and1. 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 y= s (w x+ b) = 1+ exp( (w x+ b)) passestheresulting sumthroughasigmoid function toresult inanumber between 0 and 1. Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valueby aweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen 1 (7.4) A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In this casetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector and bias: Figure7.2 Figure7.2 A neural unit, taking 3inputsx1, x2, andx3(and abiasbthat werepresent asa Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In this casetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. A neural unit, taking 3inputsx1, x2, andx3(and abiasb that werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient Figure7.2 Let swalk through an examplejust to get an intuition. Let ssupposewehavea Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: unit with thefollowingweight vector andbias: Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] b = 0.5 b = 0.5 Let swalk through an examplejust toget an intuition. Let ssupposewehavea unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] b = 0.5 b = 0.5 An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with input x: x = [0.5,0.6,0.1] Theresulting output ywouldbe: x = [0.5,0.6,0.1] w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] What would thisunit dowiththefollowinginput vector: What wouldthisunit dowith thefollowing input vector: b = 0.5 What would thisunit dowith thefollowing input vector: What wouldthisunit dowith thefollowing input vector: x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] What wouldthisunit dowith thefollowinginput vector: Theresulting output ywouldbe: x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] 1 1 1 1 1 1 Theresulting output ywouldbe: Theresulting output ywould be: Theresulting output ywouldbe: y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1 1+ e 0.87= .70 1+ e 0.87= .70 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 1+ e 0.87= .70 1 1 1 1 1 1 1 1 y= s(w x+ b) = y= s (w x+ b) = y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= Inpractice, thesigmoid isnot commonly used asanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: tanh isavariant of thesigmoid that rangesfrom -1to+1: tanh isavariant of thesigmoid that rangesfrom-1to+1: Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: y=ez e z ez+ e z y=ez e z ez+ e z ez+ e z ez+ e z tanh tanh Inpractice, thesigmoid isnot commonly usedasanactivation function. A function Inpractice, thesigmoid isnot commonly used asan activation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; Inpractice, thesigmoid isnot commonly usedasanactivation function. A function tanh tanh tanh y=ez e z ez+ e z (7.5) (7.5) y=ez e z y=ez e z (7.5) (7.5) (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and 0otherwise: when xispositive, and0otherwise: Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: y= max(x,0) y= max(x,0) y= max(x,0) y= max(x,0) ReLU Thesimplest activation function, and perhapsthemost commonly used, istherec- Thesimplest activation function, and perhaps themost commonly used, istherec- Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx ReLU ReLU ReLU ReLU (7.6) y= max(x,0) (7.6) (7.6) (7.6) (7.6)

  12. 7.1 UNITS 3 Substituting Eq.7.2intoEq.7.3givesustheoutput of aneural unit: 1 y= s(w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2showsafinal schematic of abasic neural unit. Inthisexampletheunit takes3input valuesx1,x2, andx3, andcomputesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresultingsumthroughasigmoidfunctiontoresult inanumber between0 and1. 7.1 UNITS 3 Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 y= s (w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valueby aweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoid function toresult inanumber between 0 and 1. intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. Figure7.2 weight for aninput clampedat +1) andproducing anoutput y. Weincludesomeconvenient A neural unit, taking3inputsx1, x2,andx3(andabiasbthat werepresent asa Let swalk throughanexamplejust toget anintuition. Let ssupposewehavea unit withthefollowingweight vector andbias: w = [0.2,0.3,0.9] b = 0.5 A neural unit, taking 3inputsx1, x2, andx3(and abiasbthat werepresent asa What wouldthisunit dowiththefollowinginput vector: Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. Theresultingoutput ywouldbe: x = [0.5,0.6,0.1] Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector and bias: y= s(w x+ b) = 1 1 1 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 w = [0.2,0.3,0.9] b = 0.5 Inpractice, thesigmoidisnot commonly usedasanactivationfunction. A function that isvery similar but almost alwaysbetter isthetanh functionshowninFig. 7.3a; tanhisavariant of thesigmoidthat rangesfrom-1to+1: What would thisunit dowith thefollowing input vector: tanh x = [0.5,0.6,0.1] y=ez e z ez+ e z Theresulting output ywould be: (7.5) 1 1 1 y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= Thesimplest activation function, and perhapsthemost commonly used, istherec- 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly used asan activation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to +1: Non-Linear Activation Functions besides sigmoid tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasz whenzispositive,and0otherwise: Most Common: ReLU tanh y=ez e z ez+ e z y= max(z,0) (7.6) (7.5) Thesimplest activation function, and perhaps themost commonly used, istherec- tified linear unit, also called the ReLU, shown in Fig. 7.3b. It sjust the same as x when xispositive, and 0otherwise: ReLU y= max(x,0) (7.6) ReLU tanh Rectified Linear Unit 12

  13. Simple Neural Networks and Neural Language Models Units in Neural Networks

  14. Simple Neural Networks and Neural Language Models The XOR problem

  15. 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. saturated vanishing gradient 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas The XOR problem the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: Minsky and Papert (1969) Can neural units compute simple functions of input? AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron

  16. Perceptrons A very simple neural unit Binary output (0 or 1) No non-linear activation function

  17. 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. valuesof zis1rather than very closeto 0. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high saturated saturated vanishing gradient vanishing gradient 7.2 TheXORproblem 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are Easy to build AND or OR with perceptrons works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: thetruth tablesfor thosefunctions: AND OR XOR AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 AND OR 1 1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron Thisexamplewasfirst shown for theperceptron, which isavery simpleneural perceptron

  18. 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. valuesof zis1rather than very closeto 0. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high saturated saturated vanishing gradient vanishing gradient 7.2 TheXORproblem 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are Easy to build AND or OR with perceptrons works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: thetruth tablesfor thosefunctions: AND OR XOR AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 AND OR 1 1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron Thisexamplewasfirst shown for theperceptron, which isavery simpleneural perceptron

  19. 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. valuesof zis1rather than very closeto 0. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high saturated saturated vanishing gradient vanishing gradient 7.2 TheXORproblem 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are Easy to build AND or OR with perceptrons works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: thetruth tablesfor thosefunctions: AND OR XOR AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 AND OR 1 1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron Thisexamplewasfirst shown for theperceptron, which isavery simpleneural perceptron

  20. Not possible to capture XOR with perceptrons Pause the lecture and try for yourself!

  21. Why? Perceptrons are linear classifiers Perceptron equation given x1 and x2, is the equation of a line w1x1 + w2x2 + b = 0 (in standard linear format: x2= ( w1/w2)x1+ ( b/w2) ) This line acts as a decision boundary 0 if input is on one side of the line 1 if on the other side of the line

  22. Decision boundaries x2 x2 x2 1 1 1 ? 0 0 0 x1 x1 x1 0 1 0 1 0 1 a) x1 AND x2 b) x1 OR x2 c) x1 XOR x2 XOR is not a linearly separable function!

  23. 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. saturated vanishing gradient 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: Solution to the XOR problem XOR can't be calculated by a single perceptron XOR can be calculated by a layered network of units. y1 XOR x1 x2 y 0 0 0 0 1 1 1 0 1 1 1 0 ReLU AND OR -2 1 0 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 h2 h1 +1 ReLU 1 -1 0 1 1 1 x1 x2 +1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron

  24. 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. saturated vanishing gradient 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: Solution to the XOR problem XOR can't be calculated by a single perceptron XOR can be calculated by a layered network of units. y1 XOR x1 x2 y 0 0 0 0 1 1 1 0 1 1 1 0 AND OR -2 1 0 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 h2 h1 +1 1 -1 0 1 1 1 x1 x2 +1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron

  25. y1 The hidden representation h -2 1 0 h2 h1 +1 1 -1 0 1 1 1 x1 x2 +1 x2 h2 1 1 0 0 x1 h1 0 1 0 1 2 a) The original x space b) The new (linearly separable) h space (With learning: hidden layers will learn to form useful representations)

  26. Simple Neural Networks and Neural Language Models The XOR problem

  27. Simple Neural Networks and Neural Language Models Feedforward Neural Networks

  28. Feedforward Neural Networks Can also be called multi-layer perceptrons (or MLPs) for historical reasons

  29. Binary Logistic Regression as a 1-layer Network (we don't count the input layer in counting layers!) ? = ?(? ? + ?) Output layer ( node) (y is a scalar) w w1 wn b (scalar) (vector) x1 Input layer vector x xn +1 29

  30. Multinomial Logistic Regression as a 1-layer Network Fully connected single layer network y1 s Output layer (softmax nodes) yn s s ? = softmax(?? + ?) y is a vector b W W is a matrix b is a vector xn x1 Input layer scalars +1 30

  31. Reminder: softmax: a generalization of sigmoid For a vector z of dimensionality k, the softmax is: Example:

  32. Two-Layer Network with scalar output ? = ?(?) y is a scalar Output layer ( node) z = ? U hidden units ( node) Could be ReLU Or tanh b W Input layer (vector) +1 x1 xn

  33. Two-Layer Network with scalar output ? = ?(?) y is a scalar Output layer ( node) z = ? U j hidden units ( node) Wji W b vector Input layer (vector) i +1 x1 xn

  34. Two-Layer Network with scalar output ? = ?(?) y is a scalar Output layer ( node) z = ? U hidden units ( node) Could be ReLU Or tanh b W Input layer (vector) +1 x1 xn

  35. Two-Layer Network with softmax output ? = softmax(?) Output layer ( node) z = ? U y is a vector hidden units ( node) Could be ReLU Or tanh b W Input layer (vector) +1 x1 xn

  36. Multi-layer Notation ? = ?[2] sigmoid or softmax ?[2]= ?2(?2) ?[2]= ?[2]?[1]+ ?[2] W[2 ] b[2] j ReLU ?[1]= ?1(?1) ?[1]= ?[1]?[0]+ ?[1] W[1 ] b[1] ?[0] i +1 x1 xn

  37. y Multi Layer Notation a z w1 w2 w3 b x1 x2 x3 +1 37

  38. Replacing the bias unit Let's switch to a notation without the bias unit Just a notational change 1. Add a dummy node a0=1 to each layer 2. Its weight w0 will be the bias 3. So input layer a[0]0=1, And a[1]0=1 , a[2]0=1,

  39. Replacing the bias unit Instead of: We'll do this: x= x1, x2, , xn0 x= x0, x1, x2, , xn0

  40. Replacing the bias unit Instead of: We'll do this: yn2 yn2 y1 y2 y1 y2 U U hn1 hn1 h2 h2 h3 h3 h1 h1 b W W x1 x2 xn0 x1 x2 xn0 x0=1 +1

  41. Simple Neural Networks and Neural Language Models Feedforward Neural Networks

  42. Simple Neural Networks and Neural Language Models Applying feedforward networks to NLP tasks

  43. Use cases for feedforward networks Let's consider 2 (simplified) sample tasks: 1. Text classification 2. Language modeling State of the art systems use more powerful neural architectures, but simple models are useful to consider! 43

  44. Classification: Sentiment Analysis We could do exactly what we did with logistic regression Input layer are binary features as before Output layer is 0 or 1 U W x1 xn

  45. Sentiment Features 45

  46. Feedforward nets for simple classification U 2-layer feedforward network Logistic Regression W W x1 xn x1 xn fn f1 f2 fn f1 f2 46 Just adding a hidden layer to logistic regression allows the network to use non-linear interactions between features which may (or may not) improve performance. 46

  47. Even better: representation learning U The real power of deep learningcomes from the ability to learn features from the data Instead of using hand-built human- engineered features for classification Use learned representations like embeddings! W x1 xn en e1 e2 47

  48. Neural Net Classification with embeddings as input features! 48

  49. Issue: texts come in different sizes This assumes a fixed size length (3)! Kind of unrealistic. Some simple solutions (more sophisticated solutions later) 1. Make the input the length of the longest review If shorter then pad with zero embeddings Truncate if you get longer reviews at test time 2. Create a single "sentence embedding" (the same dimensionality as a word) to represent all the words Take the mean of all the word embeddings Take the element-wise max of all the word embeddings For each dimension, pick the max value from all words 49

  50. Reminder: Multiclass Outputs What if you have more than two output classes? Add more output units (one for each class) And use a softmax layer U W xn x1 50

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#