A Deep Dive into Neural Network Units and Language Models

undefined

Simple Neural

Networks and

Neural

Language

Models

Units in Neural Networks

This is in your brain

By BruceBlaus - Own work, CC BY 3.0,

https://commons.wikimedia.org/w/index.php?curid=28761830

Neural Network Unit

This is not in your brain

Weights

Input layer

Weighted sum

Non-linear transform

Output value

bias

Neural unit

Take weighted sum of inputs, plus a bias

Instead of just using z, we'll apply a nonlinear activation

function f:

Non-Linear Activation Functions

Sigmoid

We're already seen the sigmoid for logistic regression:

Final function the unit is computing

Final unit again

Weights

Input layer

Weighted sum

Non-linear activation function

Output value

bias

An example

Suppose a unit has:

= [0.2,0.3,0.9]

= 0.5

What happens with input x:

= [0.5,0.6,0.1]

An example

Suppose a unit has:

= [0.2,0.3,0.9]

= 0.5

What happens with the following input x?

= [0.5,0.6,0.1]

An example

Suppose a unit has:

= [0.2,0.3,0.9]

= 0.5

What happens with input x:

= [0.5,0.6,0.1]

An example

Suppose a unit has:

= [0.2,0.3,0.9]

= 0.5

What happens with input x:

= [0.5,0.6,0.1]

Non-Linear

Activation

Functions besides sigmoid

tanh

ReLU

Rectified Linear Unit

Most Common:

undefined

Simple Neural

Networks and

Neural

Language

Models

Units in Neural Networks

undefined

Simple Neural

Networks and

Neural

Language

Models

The XOR problem

The XOR problem

Can neural units compute simple functions of input?

Minsky and Papert (1969)

Perceptrons

A very simple neural unit

•

Binary output  (0 or 1)

•

No non-linear activation function

Easy to build AND or OR with perceptrons

AND

OR

Easy to build AND or OR with perceptrons

AND

OR

Easy to build AND or OR with perceptrons

AND

OR

Not possible to capture XOR with perceptrons

Pause the lecture and try for yourself!

Why? Perceptrons are linear classifiers

Perceptron equation given

and

, is the equation of a line

= 0

(in standard linear format:

 = (−

 + (−

)    )

This line acts as a

decision boundary

•

0 if input is on one side of the line

•

1 if on the other side of the line

Decision boundaries

XOR is not a

linearly separable

function!

Solution to the XOR problem

XOR

can't

 be calculated by a single perceptron

XOR

can

 be calculated by a layered network of units.

ReLU

ReLU

Solution to the XOR problem

XOR

can't

 be calculated by a single perceptron

XOR

can

 be calculated by a layered network of units.

The hidden representation h

(With learning:  hidden layers will learn to form useful representations)

undefined

Simple Neural

Networks and

Neural

Language

Models

The XOR problem

undefined

Simple Neural

Networks and

Neural

Language

Models

Feedforward Neural Networks

Feedforward Neural Networks

Can also be called

multi-layer perceptrons

(or

MLPs

)  for historical reasons

Binary Logistic Regression as a 1-layer Network

+1

(y is a scalar)

σ

Output layer

σ node)

Input layer

vector x

(we don't count the input layer in counting layers!)

(vector)

(scalar)

Multinomial Logistic Regression as a 1-layer Network

Fully connected single layer network

W is a

matrix

+1

y is a vector

b is a vector

Output layer

softmax nodes)

Input layer

scalars

Reminder: softmax: a generalization of sigmoid

For a vector

of dimensionality

, the softmax is:

Example:

Two-Layer Network with scalar output

+1

y is a scalar

hidden units

σ node)

Input layer

(vector)

Output layer

σ node)

Could be ReLU

Or tanh

Two-Layer Network with scalar output

+1

hidden units

σ node)

Input layer

(vector)

Output layer

σ node)

ji

vector

y is a scalar

Two-Layer Network with scalar output

+1

hidden units

σ node)

Input layer

(vector)

Output layer

σ node)

Could be ReLU

Or tanh

y is a scalar

Two-Layer Network with softmax output

+1

hidden units

σ node)

Input layer

(vector)

Output layer

σ node)

Could be ReLU

Or tanh

y is a vector

Multi-layer Notation

[1

+1

[1]

[2

[2]

sigmoid or softmax

ReLU

Multi Layer Notation

Replacing the bias unit

Let's switch to a notation without the bias unit

Just a notational change

1.

Add a dummy node a

=1 to each layer

2.

Its weight w

 will be the bias

3.

So input layer a

[0]

=1,

◦

And a

[1]

=1 , a

[2]

=1,…

Replacing the bias unit

Instead of:

We'll do this:

x= x

, x

, …, x

n0

x= x

, x

, x

, …, x

n0

Replacing the bias unit

Instead of:

We'll do this:

undefined

Simple Neural

Networks and

Neural

Language

Models

Feedforward Neural Networks

undefined

Simple Neural

Networks and

Neural

Language

Models

Applying feedforward networks

to NLP tasks

Use cases for feedforward networks

Let's consider 2 (simplified) sample tasks:

1.

Text classification

2.

Language modeling

State of the art systems use more powerful neural

architectures, but simple models are useful to

consider!

Classification: Sentiment Analysis

We could do exactly what we did with logistic

regression

Input layer are binary features as before

Output layer is 0 or 1

σ

Sentiment Features

Feedforward nets for simple classification

Just adding a hidden layer to logistic regression

•

allows the network to use non-linear interactions between features

•

which may (or may not) improve performance.

Logistic

Regression

2-layer

 feedforward

 network

σ

σ

Even better: representation learning

The real power of deep learning

comes

from the  ability to

learn

 features from

the data

Instead of using hand-built human-

engineered features for classification

Use learned representations like

embeddings!

σ

Neural Net Classification with embeddings as input features!

Issue: texts come in different sizes

This assumes a fixed size length (3)!

Kind of unrealistic.

Some simple solutions (more sophisticated solutions later)

1.

Make the input the length of the longest review

•

If shorter then pad with zero embeddings

•

Truncate if you get longer reviews at test time

2.

Create a single "sentence embedding" (the same

dimensionality as a word) to represent all the words

•

Take the mean of all the word embeddings

•

Take the element-wise max of all the word embeddings

•

For each dimension, pick the max value from all words

Reminder: Multiclass Outputs

What if you have more than two output classes?

◦

Add more output units (one for each class)

◦

And use a “softmax layer”

Neural Language Models (LMs)

Language Modeling

: Calculating the probability of the

next word in a sequence given some history.

•

We've seen N-gram based LMs

•

But neural network LMs far outperform n-gram

language models

State-of-the-art neural LMs are based on more

powerful neural network technology like Transformers

But

simple feedforward LMs

can do almost as well!

Simple feedforward Neural Language Models

Task

: predict next word

  given prior words

t-1

, w

t-2

, w

t-3

, …

Problem

: Now we’re dealing with sequences of

arbitrary length.

Solution

: Sliding windows (of fixed length)

Neural Language Model

Why Neural LMs work better than N-gram LMs

Training data:

We've seen

:  I have to make sure that the cat gets fed.

Never seen:

dog gets fed

Test data:

I forgot to make sure that the dog gets ___

N-gram LM can't predict "

fed

"!

Neural LM

can use similarity of "

cat

" and "

dog

embeddings to generalize and predict “

fed

” after

dog

undefined

Simple Neural

Networks and

Neural

Language

Models

Applying feedforward networks

to NLP tasks

undefined

Simple Neural

Networks and

Neural

Language

Models

Training Neural Nets: Overview

Intuition: training a 2-layer Network

Training instance

Intuition: Training a 2-layer network

Reminder: Loss Function for binary logistic regression

A measure for how far off the current answer is to

the right answer

Cross entropy loss for logistic regression:

Reminder: gradient descent for weight updates

Where did that derivative come from?

Using the chain rule!

) =

))

Intuition (see the text for details)

Derivative of the Loss

Derivative of the Activation

Derivative of the weighted sum

How can I find that gradient for every weight in

the network?

These derivatives on the prior slide only give the

updates for one weight layer: the last one!

What about deeper networks?

•

Lots of layers, different activation functions?

Solution in the next lecture:

•

Even more use of the chain rule!!

•

Computation graphs and backward differentiation!

undefined

Simple Neural

Networks and

Neural

Language

Models

Training Neural Nets: Overview

undefined

Simple Neural

Networks and

Neural

Language

Models

Computation Graphs and

Backward Differentiation

Why Computation Graphs

For training, we need the derivative of the loss with

respect to each weight in every layer of the network

•

But the loss is computed only at the very end of the

network!

Solution:

error backpropagation

(Rumelhart, Hinton, Williams, 1986)

•

Backprop

 is a special case of

backward differentiation

•

Which relies on

computation graphs

Computation Graphs

A computation graph represents the process of

computing a mathematical expression

Example:

Computations:

Example:

Computations:

Backwards differentiation in computation graphs

The importance of the computation graph

comes from the backward pass

This is used to compute the derivatives that we’ll

need for the weight update.

Example

We want:

The chain rule

Computing the derivative of a composite function:

) =

))

) =

u(v(w(x)

))

Example

Example

Example

Example

Backward differentiation on a two layer network

σ

[2]

[1]

Sigmoid activation

ReLU activation

[1]

[2]

Backward differentiation on a two layer network

Backward differentiation on a 2-layer network

Summary

For training, we need the derivative of the loss with respect to

weights in early layers of the network

•

But loss is computed only at the very end of the network!

Solution:

backward differentiation

Given a computation graph and the derivatives of all the

functions in it we can automatically compute the derivative of

the loss with respect to these early weights.

undefined

Simple Neural

Networks and

Neural

Language

Models

Computation Graphs and

Backward Differentiation

Slide Note

Here we introduce the individual computing unit that makes up neural networks.

Embed Share

Download

Explore the fundamentals of neural network units in language models, discussing computation, weights, biases, and activations. Understand the essence of weighted sums in neural networks and the application of non-linear activation functions like sigmoid, tanh, and ReLU. Dive into the heart of neural networks to grasp the building blocks and principles behind them.

rowyn Follow

Uploaded on Jul 22, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Simple Neural Networks and Neural Language Models Units in Neural Networks

This is in your brain By BruceBlaus - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=28761830 2

Neural Network Unit This is not in your brain y Output value a Non-linear transform z Weighted sum bias Weights Input layer w1 w2 w3 b x1 x2 x3 +1 3

2 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 7.1 Units Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and producesanoutput. 2 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS At itsheart, aneural unit istaking aweighted sumof itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputsx1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sumzcan be represented as: z= b+ biasterm 7.1 Units X wixi (7.1) Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and i NEURAL NETWORKS AND NEURAL LANGUAGE MODELS producesanoutput. At itsheart, aneural unit istaking aweighted sumof itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputsx1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sumzcan be represented as: z= b+ z= w x+ b 2 CHAPTER 7 Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebrathat avector is, at heart, just alist or array of numbers. Thus vector we ll talk about zintermsof aweight vector w, ascalar biasb, andaninput vector x,andwe ll replacethesumwiththeconvenient dot product: 7.1 Units biasterm X Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and producesanoutput. At itsheart, aneural unit istaking aweighted sum of itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputs x1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sum zcan be represented as: X x,andwe ll replacethesumwiththeconvenient dot product: theactivation valuefor theunit, a. Sincewearejust modeling asingle unit, the activationforthenodeisinfactthefinal outputof thenetwork,whichwe ll generally call y. Sothevalueyisdefinedas: wixi (7.1) (7.2) i Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebrathat avector is, at heart, just alist or array of numbers. Thus we ll talk about zintermsof aweight vector w, ascalar biasb, andaninput vector apply anon-linear function f to z. Wewill refer to theoutput of this function as AsdefinedinEq.7.2,zisjust areal valuednumber. Finally, instead of using z, a linear function of x, as the output, neural units Neural unit biasterm vector Take weighted sum of inputs, plus a bias z= b+ wixi (7.1) activation i z= w x+ b (7.2) Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebra that avector is, at heart, just alist or array of numbers. Thus AsdefinedinEq. 7.2,zisjust areal valuednumber. y= a= f(z) vector Instead of just using z, we'll apply a nonlinear activation function f: we ll talk about zin termsof aweight vector w, ascalar biasb, and an input vector x, andwe ll replacethesumwiththeconvenient dot product: Finally, instead of using z, a linear function of x, as the output, neural units apply anon-linear function f to z. Wewill refer to theoutput of this function as theactivation valuefor theunit, a. Since wearejust modeling asingle unit, the activationforthenodeisinfactthefinal outputof thenetwork,whichwe ll generally call y. Sothevalueyisdefinedas: sigmoid functionsincewesawit inChapter 5: z= w x+ b (7.2) We ll discussthreepopular non-linear functions f() below (thesigmoid, thetanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the activation Asdefined inEq. 7.2, zisjust areal valued number. Finally, instead of using z, a linear function of x, as the output, neural units apply a non-linear function f to z. We will refer to the output of this function as the activation value for the unit, a. Since we are just modeling a single unit, the activationforthenodeisinfact thefinal output of thenetwork,whichwe ll generally call y. Sothevalueyisdefined as: We ll discussthreepopular non-linear functions f() below (thesigmoid, thetanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the sigmoid functionsincewesaw it inChapter 5: into therange[0,1], which isuseful in squashing outliers toward 0 or 1. And it s differentiable, whichaswesawinSection??will behandy for learning. sigmoid activation 1 y= a= f(z) y= s(z) = (7.3) 1+ e z Thesigmoid(showninFig. 7.1) hasanumber of advantages; it mapstheoutput y= a= f(z) sigmoid We ll discuss three popular non-linear functions f() below (the sigmoid, the tanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the sigmoid function sincewesaw it inChapter 5: y= s(z) = 1 (7.3) sigmoid 1+ e z 1 Thesigmoid(showninFig. 7.1) hasanumber of advantages; it mapstheoutput into therange[0,1], which isuseful in squashing outliers toward 0 or 1. And it s differentiable, whichaswesawinSection??will behandy for learning. y= s(z) = (7.3) 1+ e z Thesigmoid (showninFig. 7.1) hasanumber of advantages; it mapstheoutput into the range [0,1], which is useful in squashing outliers toward 0 or 1. And it s differentiable, which aswesaw inSection ??will behandy for learning. Figure7.1 nearly linear around0but outlier valuesget squashedtoward0or 1. The sigmoid function takesa real value and maps it to the range [0,1]. It is Figure7.1 nearly linear around 0but outlier valuesget squashed toward0or 1. nearly linear around0but outlier valuesget squashedtoward0or 1. The sigmoid function takes a real value and maps it to the range [0,1]. It is The sigmoid function takesa real value and maps it to the range [0,1]. It is Figure7.1

2 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 7.1 Units Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and producesanoutput. At itsheart, aneural unit istaking aweighted sum of itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputs x1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sum zcan be represented as: z= b+ biasterm X wixi (7.1) i Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebra that avector is, at heart, just alist or array of numbers. Thus we ll talk about zin termsof aweight vector w, ascalar biasb, and aninput vector x, andwe ll replacethesumwiththeconvenient dot product: vector z= w x+ b (7.2) Asdefined inEq. 7.2, zisjust areal valued number. Finally, instead of using z, a linear function of x, as the output, neural units apply a non-linear function f to z. We will refer to the output of this function as the activation value for the unit, a. Since we are just modeling a single unit, the activationforthenodeisinfactthefinal output of thenetwork,whichwe ll generally call y. Sothevalueyisdefinedas: activation Non-Linear Activation Functions y= a= f(z) We're already seen the sigmoid for logistic regression: We ll discuss three popular non-linear functions f() below (the sigmoid, thetanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the sigmoid function sincewesaw it inChapter 5: sigmoid Sigmoid 1 y= s (z) = (7.3) 1+ e z Thesigmoid (showninFig. 7.1) hasanumber of advantages; it mapstheoutput into the range [0,1], which is useful in squashing outliers toward 0 or 1. And it s differentiable, whichaswesaw inSection ??will behandy for learning. 5 Figure7.1 nearly linear around0but outlier valuesget squashed toward0or 1. The sigmoid function takes a real value and maps it to the range [0,1]. It is

Final function the unit is computing 7.1 UNITS 3 Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 y= s(w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 and1. Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] b = 0.5 What wouldthisunit dowith thefollowinginput vector: x = [0.5,0.6,0.1] Theresulting output ywouldbe: 1 1 1 y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: tanh y=ez e z ez+ e z (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: ReLU y= max(x,0) (7.6)

Final unit again y Output value a Non-linear activation function z Weighted sum bias Weights Input layer w1 w2 w3 b x1 x2 x3 +1 7

7.1 UNITS 3 Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: 1 y= s (w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: w = [0.2,0.3,0.9] b = 0.5 An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with input x: x = [0.5,0.6,0.1] What would thisunit dowiththefollowinginput vector: x = [0.5,0.6,0.1] Theresulting output ywouldbe: 1 1 1 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: tanh y=ez e z ez+ e z (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: ReLU y= max(x,0) (7.6)

7.1 7.1 UNITS UNITS 3 3 Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 1 y= s (w x+ b) = y= s(w x+ b) = (7.4) (7.4) 1+ exp( (w x+ b)) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to Figure7.2 Let swalk through an examplejust to get an intuition. Let ssupposewehavea Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] b = 0.5 b = 0.5 An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with the following input x? x = [0.5,0.6,0.1] Theresulting output ywouldbe: What would thisunit dowiththefollowinginput vector: What wouldthisunit dowith thefollowing input vector: x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] Theresulting output ywouldbe: 1 1 1 1 1 1 y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly used asanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: y=ez e z ez+ e z tanh tanh y=ez e z ez+ e z (7.5) (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: y= max(x,0) ReLU ReLU (7.6) y= max(x,0) (7.6)

7.1 7.1 UNITS UNITS 3 3 Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 1 y= s (w x+ b) = y= s(w x+ b) = 7.1 (7.4) (7.4) 1+ exp( (w x+ b)) 1+ exp( (w x+ b)) UNITS 3 Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between 0 and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 y= s (w x+ b) = 1+ exp( (w x+ b)) Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: 1 (7.4) Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In this casetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa Figure7.2 Figure7.2 Let swalk through an examplejust to get an intuition. Let ssupposewehavea Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] b = 0.5 b = 0.5 Let swalk through an examplejust toget an intuition. Let ssupposewehavea unit with thefollowingweight vector andbias: An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with input x: x = [0.5,0.6,0.1] y= s (w x+ b) = w = [0.2,0.3,0.9] What would thisunit dowiththefollowinginput vector: What wouldthisunit dowith thefollowing input vector: b = 0.5 x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] What wouldthisunit dowith thefollowinginput vector: Theresulting output ywouldbe: Theresulting output ywouldbe: x = [0.5,0.6,0.1] 1 1 1 1 1 1 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Theresulting output ywouldbe: y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 1+ e 0.87= .70 1 1 1 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= Inpractice, thesigmoid isnot commonly used asanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: y=ez e z ez+ e z y=ez e z ez+ e z tanh tanh Inpractice, thesigmoid isnot commonly usedasanactivation function. A function tanh y=ez e z ez+ e z (7.5) (7.5) (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: y= max(x,0) y= max(x,0) ReLU Thesimplest activation function, and perhapsthemost commonly used, istherec- ReLU ReLU (7.6) y= max(x,0) (7.6) (7.6)

7.1 UNITS 3 7.1 7.1 UNITS UNITS 3 3 7.1 UNITS 3 Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2 into Eq. 7.3givesustheoutput of aneural unit: y= s (w x+ b) = 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoid functiontoresult inanumber between 0 and1. 1 1 (7.4) 1 y= s (w x+ b) = y= s(w x+ b) = 7.1 (7.4) (7.4) 1+ exp( (w x+ b)) 1+ exp( (w x+ b)) UNITS 3 (7.4) 1 y= s (w x+ b) = Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between 0 and1. 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 y= s (w x+ b) = 1+ exp( (w x+ b)) passestheresulting sumthroughasigmoid function toresult inanumber between 0 and 1. Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valueby aweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen 1 (7.4) A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In this casetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector and bias: Figure7.2 Figure7.2 A neural unit, taking 3inputsx1, x2, andx3(and abiasbthat werepresent asa Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In this casetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. A neural unit, taking 3inputsx1, x2, andx3(and abiasb that werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient Figure7.2 Let swalk through an examplejust to get an intuition. Let ssupposewehavea Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: unit with thefollowingweight vector andbias: Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] b = 0.5 b = 0.5 Let swalk through an examplejust toget an intuition. Let ssupposewehavea unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] b = 0.5 b = 0.5 An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with input x: x = [0.5,0.6,0.1] Theresulting output ywouldbe: x = [0.5,0.6,0.1] w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] What would thisunit dowiththefollowinginput vector: What wouldthisunit dowith thefollowing input vector: b = 0.5 What would thisunit dowith thefollowing input vector: What wouldthisunit dowith thefollowing input vector: x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] What wouldthisunit dowith thefollowinginput vector: Theresulting output ywouldbe: x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] 1 1 1 1 1 1 Theresulting output ywouldbe: Theresulting output ywould be: Theresulting output ywouldbe: y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1 1+ e 0.87= .70 1+ e 0.87= .70 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 1+ e 0.87= .70 1 1 1 1 1 1 1 1 y= s(w x+ b) = y= s (w x+ b) = y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= Inpractice, thesigmoid isnot commonly used asanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: tanh isavariant of thesigmoid that rangesfrom -1to+1: tanh isavariant of thesigmoid that rangesfrom-1to+1: Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: y=ez e z ez+ e z y=ez e z ez+ e z ez+ e z ez+ e z tanh tanh Inpractice, thesigmoid isnot commonly usedasanactivation function. A function Inpractice, thesigmoid isnot commonly used asan activation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; Inpractice, thesigmoid isnot commonly usedasanactivation function. A function tanh tanh tanh y=ez e z ez+ e z (7.5) (7.5) y=ez e z y=ez e z (7.5) (7.5) (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and 0otherwise: when xispositive, and0otherwise: Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: y= max(x,0) y= max(x,0) y= max(x,0) y= max(x,0) ReLU Thesimplest activation function, and perhapsthemost commonly used, istherec- Thesimplest activation function, and perhaps themost commonly used, istherec- Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx ReLU ReLU ReLU ReLU (7.6) y= max(x,0) (7.6) (7.6) (7.6) (7.6)

7.1 UNITS 3 Substituting Eq.7.2intoEq.7.3givesustheoutput of aneural unit: 1 y= s(w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2showsafinal schematic of abasic neural unit. Inthisexampletheunit takes3input valuesx1,x2, andx3, andcomputesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresultingsumthroughasigmoidfunctiontoresult inanumber between0 and1. 7.1 UNITS 3 Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 y= s (w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valueby aweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoid function toresult inanumber between 0 and 1. intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. Figure7.2 weight for aninput clampedat +1) andproducing anoutput y. Weincludesomeconvenient A neural unit, taking3inputsx1, x2,andx3(andabiasbthat werepresent asa Let swalk throughanexamplejust toget anintuition. Let ssupposewehavea unit withthefollowingweight vector andbias: w = [0.2,0.3,0.9] b = 0.5 A neural unit, taking 3inputsx1, x2, andx3(and abiasbthat werepresent asa What wouldthisunit dowiththefollowinginput vector: Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. Theresultingoutput ywouldbe: x = [0.5,0.6,0.1] Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector and bias: y= s(w x+ b) = 1 1 1 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 w = [0.2,0.3,0.9] b = 0.5 Inpractice, thesigmoidisnot commonly usedasanactivationfunction. A function that isvery similar but almost alwaysbetter isthetanh functionshowninFig. 7.3a; tanhisavariant of thesigmoidthat rangesfrom-1to+1: What would thisunit dowith thefollowing input vector: tanh x = [0.5,0.6,0.1] y=ez e z ez+ e z Theresulting output ywould be: (7.5) 1 1 1 y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= Thesimplest activation function, and perhapsthemost commonly used, istherec- 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly used asan activation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to +1: Non-Linear Activation Functions besides sigmoid tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasz whenzispositive,and0otherwise: Most Common: ReLU tanh y=ez e z ez+ e z y= max(z,0) (7.6) (7.5) Thesimplest activation function, and perhaps themost commonly used, istherec- tified linear unit, also called the ReLU, shown in Fig. 7.3b. It sjust the same as x when xispositive, and 0otherwise: ReLU y= max(x,0) (7.6) ReLU tanh Rectified Linear Unit 12

Simple Neural Networks and Neural Language Models Units in Neural Networks

Simple Neural Networks and Neural Language Models The XOR problem

4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. saturated vanishing gradient 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas The XOR problem the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: Minsky and Papert (1969) Can neural units compute simple functions of input? AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron

Perceptrons A very simple neural unit Binary output (0 or 1) No non-linear activation function

4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. valuesof zis1rather than very closeto 0. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high saturated saturated vanishing gradient vanishing gradient 7.2 TheXORproblem 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are Easy to build AND or OR with perceptrons works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: thetruth tablesfor thosefunctions: AND OR XOR AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 AND OR 1 1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron Thisexamplewasfirst shown for theperceptron, which isavery simpleneural perceptron

4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. valuesof zis1rather than very closeto 0. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high saturated saturated vanishing gradient vanishing gradient 7.2 TheXORproblem 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are Easy to build AND or OR with perceptrons works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: thetruth tablesfor thosefunctions: AND OR XOR AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 AND OR 1 1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron Thisexamplewasfirst shown for theperceptron, which isavery simpleneural perceptron

4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. valuesof zis1rather than very closeto 0. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high saturated saturated vanishing gradient vanishing gradient 7.2 TheXORproblem 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are Easy to build AND or OR with perceptrons works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: thetruth tablesfor thosefunctions: AND OR XOR AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 AND OR 1 1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron Thisexamplewasfirst shown for theperceptron, which isavery simpleneural perceptron

Not possible to capture XOR with perceptrons Pause the lecture and try for yourself!

Why? Perceptrons are linear classifiers Perceptron equation given x1 and x2, is the equation of a line w1x1 + w2x2 + b = 0 (in standard linear format: x2= ( w1/w2)x1+ ( b/w2) ) This line acts as a decision boundary 0 if input is on one side of the line 1 if on the other side of the line

Decision boundaries x2 x2 x2 1 1 1 ? 0 0 0 x1 x1 x1 0 1 0 1 0 1 a) x1 AND x2 b) x1 OR x2 c) x1 XOR x2 XOR is not a linearly separable function!

4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. saturated vanishing gradient 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: Solution to the XOR problem XOR can't be calculated by a single perceptron XOR can be calculated by a layered network of units. y1 XOR x1 x2 y 0 0 0 0 1 1 1 0 1 1 1 0 ReLU AND OR -2 1 0 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 h2 h1 +1 ReLU 1 -1 0 1 1 1 x1 x2 +1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron

4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. saturated vanishing gradient 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: Solution to the XOR problem XOR can't be calculated by a single perceptron XOR can be calculated by a layered network of units. y1 XOR x1 x2 y 0 0 0 0 1 1 1 0 1 1 1 0 AND OR -2 1 0 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 h2 h1 +1 1 -1 0 1 1 1 x1 x2 +1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron

y1 The hidden representation h -2 1 0 h2 h1 +1 1 -1 0 1 1 1 x1 x2 +1 x2 h2 1 1 0 0 x1 h1 0 1 0 1 2 a) The original x space b) The new (linearly separable) h space (With learning: hidden layers will learn to form useful representations)

Simple Neural Networks and Neural Language Models The XOR problem

Simple Neural Networks and Neural Language Models Feedforward Neural Networks

Feedforward Neural Networks Can also be called multi-layer perceptrons (or MLPs) for historical reasons

Binary Logistic Regression as a 1-layer Network (we don't count the input layer in counting layers!) ? = ?(? ? + ?) Output layer ( node) (y is a scalar) w w1 wn b (scalar) (vector) x1 Input layer vector x xn +1 29

Multinomial Logistic Regression as a 1-layer Network Fully connected single layer network y1 s Output layer (softmax nodes) yn s s ? = softmax(?? + ?) y is a vector b W W is a matrix b is a vector xn x1 Input layer scalars +1 30

Reminder: softmax: a generalization of sigmoid For a vector z of dimensionality k, the softmax is: Example:

Two-Layer Network with scalar output ? = ?(?) y is a scalar Output layer ( node) z = ? U hidden units ( node) Could be ReLU Or tanh b W Input layer (vector) +1 x1 xn

Two-Layer Network with scalar output ? = ?(?) y is a scalar Output layer ( node) z = ? U j hidden units ( node) Wji W b vector Input layer (vector) i +1 x1 xn

Two-Layer Network with scalar output ? = ?(?) y is a scalar Output layer ( node) z = ? U hidden units ( node) Could be ReLU Or tanh b W Input layer (vector) +1 x1 xn

Two-Layer Network with softmax output ? = softmax(?) Output layer ( node) z = ? U y is a vector hidden units ( node) Could be ReLU Or tanh b W Input layer (vector) +1 x1 xn

Multi-layer Notation ? = ?[2] sigmoid or softmax ?[2]= ?2(?2) ?[2]= ?[2]?[1]+ ?[2] W[2 ] b[2] j ReLU ?[1]= ?1(?1) ?[1]= ?[1]?[0]+ ?[1] W[1 ] b[1] ?[0] i +1 x1 xn

y Multi Layer Notation a z w1 w2 w3 b x1 x2 x3 +1 37

Replacing the bias unit Let's switch to a notation without the bias unit Just a notational change 1. Add a dummy node a0=1 to each layer 2. Its weight w0 will be the bias 3. So input layer a[0]0=1, And a[1]0=1 , a[2]0=1,

Replacing the bias unit Instead of: We'll do this: x= x1, x2, , xn0 x= x0, x1, x2, , xn0

Replacing the bias unit Instead of: We'll do this: yn2 yn2 y1 y2 y1 y2 U U hn1 hn1 h2 h2 h3 h3 h1 h1 b W W x1 x2 xn0 x1 x2 xn0 x0=1 +1

Simple Neural Networks and Neural Language Models Feedforward Neural Networks

Simple Neural Networks and Neural Language Models Applying feedforward networks to NLP tasks

Use cases for feedforward networks Let's consider 2 (simplified) sample tasks: 1. Text classification 2. Language modeling State of the art systems use more powerful neural architectures, but simple models are useful to consider! 43

Classification: Sentiment Analysis We could do exactly what we did with logistic regression Input layer are binary features as before Output layer is 0 or 1 U W x1 xn

Sentiment Features 45

Feedforward nets for simple classification U 2-layer feedforward network Logistic Regression W W x1 xn x1 xn fn f1 f2 fn f1 f2 46 Just adding a hidden layer to logistic regression allows the network to use non-linear interactions between features which may (or may not) improve performance. 46

Even better: representation learning U The real power of deep learningcomes from the ability to learn features from the data Instead of using hand-built human- engineered features for classification Use learned representations like embeddings! W x1 xn en e1 e2 47

Neural Net Classification with embeddings as input features! 48

Issue: texts come in different sizes This assumes a fixed size length (3)! Kind of unrealistic. Some simple solutions (more sophisticated solutions later) 1. Make the input the length of the longest review If shorter then pad with zero embeddings Truncate if you get longer reviews at test time 2. Create a single "sentence embedding" (the same dimensionality as a word) to represent all the words Take the mean of all the word embeddings Take the element-wise max of all the word embeddings For each dimension, pick the max value from all words 49

Reminder: Multiclass Outputs What if you have more than two output classes? Add more output units (one for each class) And use a softmax layer U W xn x1 50

A Deep Dive into Neural Network Units and Language Models

Download Presentation

Presentation Transcript

Related

More Related Content