A Deep Dive into Neural Network Units and Language Models
Explore the fundamentals of neural network units in language models, discussing computation, weights, biases, and activations. Understand the essence of weighted sums in neural networks and the application of non-linear activation functions like sigmoid, tanh, and ReLU. Dive into the heart of neural networks to grasp the building blocks and principles behind them.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Simple Neural Networks and Neural Language Models Units in Neural Networks
This is in your brain By BruceBlaus - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=28761830 2
Neural Network Unit This is not in your brain y Output value a Non-linear transform z Weighted sum bias Weights Input layer w1 w2 w3 b x1 x2 x3 +1 3
2 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 7.1 Units Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and producesanoutput. 2 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS At itsheart, aneural unit istaking aweighted sumof itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputsx1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sumzcan be represented as: z= b+ biasterm 7.1 Units X wixi (7.1) Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and i NEURAL NETWORKS AND NEURAL LANGUAGE MODELS producesanoutput. At itsheart, aneural unit istaking aweighted sumof itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputsx1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sumzcan be represented as: z= b+ z= w x+ b 2 CHAPTER 7 Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebrathat avector is, at heart, just alist or array of numbers. Thus vector we ll talk about zintermsof aweight vector w, ascalar biasb, andaninput vector x,andwe ll replacethesumwiththeconvenient dot product: 7.1 Units biasterm X Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and producesanoutput. At itsheart, aneural unit istaking aweighted sum of itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputs x1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sum zcan be represented as: X x,andwe ll replacethesumwiththeconvenient dot product: theactivation valuefor theunit, a. Sincewearejust modeling asingle unit, the activationforthenodeisinfactthefinal outputof thenetwork,whichwe ll generally call y. Sothevalueyisdefinedas: wixi (7.1) (7.2) i Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebrathat avector is, at heart, just alist or array of numbers. Thus we ll talk about zintermsof aweight vector w, ascalar biasb, andaninput vector apply anon-linear function f to z. Wewill refer to theoutput of this function as AsdefinedinEq.7.2,zisjust areal valuednumber. Finally, instead of using z, a linear function of x, as the output, neural units Neural unit biasterm vector Take weighted sum of inputs, plus a bias z= b+ wixi (7.1) activation i z= w x+ b (7.2) Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebra that avector is, at heart, just alist or array of numbers. Thus AsdefinedinEq. 7.2,zisjust areal valuednumber. y= a= f(z) vector Instead of just using z, we'll apply a nonlinear activation function f: we ll talk about zin termsof aweight vector w, ascalar biasb, and an input vector x, andwe ll replacethesumwiththeconvenient dot product: Finally, instead of using z, a linear function of x, as the output, neural units apply anon-linear function f to z. Wewill refer to theoutput of this function as theactivation valuefor theunit, a. Since wearejust modeling asingle unit, the activationforthenodeisinfactthefinal outputof thenetwork,whichwe ll generally call y. Sothevalueyisdefinedas: sigmoid functionsincewesawit inChapter 5: z= w x+ b (7.2) We ll discussthreepopular non-linear functions f() below (thesigmoid, thetanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the activation Asdefined inEq. 7.2, zisjust areal valued number. Finally, instead of using z, a linear function of x, as the output, neural units apply a non-linear function f to z. We will refer to the output of this function as the activation value for the unit, a. Since we are just modeling a single unit, the activationforthenodeisinfact thefinal output of thenetwork,whichwe ll generally call y. Sothevalueyisdefined as: We ll discussthreepopular non-linear functions f() below (thesigmoid, thetanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the sigmoid functionsincewesaw it inChapter 5: into therange[0,1], which isuseful in squashing outliers toward 0 or 1. And it s differentiable, whichaswesawinSection??will behandy for learning. sigmoid activation 1 y= a= f(z) y= s(z) = (7.3) 1+ e z Thesigmoid(showninFig. 7.1) hasanumber of advantages; it mapstheoutput y= a= f(z) sigmoid We ll discuss three popular non-linear functions f() below (the sigmoid, the tanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the sigmoid function sincewesaw it inChapter 5: y= s(z) = 1 (7.3) sigmoid 1+ e z 1 Thesigmoid(showninFig. 7.1) hasanumber of advantages; it mapstheoutput into therange[0,1], which isuseful in squashing outliers toward 0 or 1. And it s differentiable, whichaswesawinSection??will behandy for learning. y= s(z) = (7.3) 1+ e z Thesigmoid (showninFig. 7.1) hasanumber of advantages; it mapstheoutput into the range [0,1], which is useful in squashing outliers toward 0 or 1. And it s differentiable, which aswesaw inSection ??will behandy for learning. Figure7.1 nearly linear around0but outlier valuesget squashedtoward0or 1. The sigmoid function takesa real value and maps it to the range [0,1]. It is Figure7.1 nearly linear around 0but outlier valuesget squashed toward0or 1. nearly linear around0but outlier valuesget squashedtoward0or 1. The sigmoid function takes a real value and maps it to the range [0,1]. It is The sigmoid function takesa real value and maps it to the range [0,1]. It is Figure7.1
2 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 7.1 Units Thebuilding block of aneural network isasinglecomputational unit. A unit takes a set of real valued numbers as input, performs some computation on them, and producesanoutput. At itsheart, aneural unit istaking aweighted sum of itsinputs, with oneaddi- tional term in thesum called abiasterm. Given aset of inputs x1...xn, aunit has aset of corresponding weightsw1...wnand abiasb, so theweighted sum zcan be represented as: z= b+ biasterm X wixi (7.1) i Oftenit smoreconvenient toexpressthisweightedsumusingvector notation; recall from linear algebra that avector is, at heart, just alist or array of numbers. Thus we ll talk about zin termsof aweight vector w, ascalar biasb, and aninput vector x, andwe ll replacethesumwiththeconvenient dot product: vector z= w x+ b (7.2) Asdefined inEq. 7.2, zisjust areal valued number. Finally, instead of using z, a linear function of x, as the output, neural units apply a non-linear function f to z. We will refer to the output of this function as the activation value for the unit, a. Since we are just modeling a single unit, the activationforthenodeisinfactthefinal output of thenetwork,whichwe ll generally call y. Sothevalueyisdefinedas: activation Non-Linear Activation Functions y= a= f(z) We're already seen the sigmoid for logistic regression: We ll discuss three popular non-linear functions f() below (the sigmoid, thetanh, and the rectified linear ReLU) but it s pedagogically convenient to start with the sigmoid function sincewesaw it inChapter 5: sigmoid Sigmoid 1 y= s (z) = (7.3) 1+ e z Thesigmoid (showninFig. 7.1) hasanumber of advantages; it mapstheoutput into the range [0,1], which is useful in squashing outliers toward 0 or 1. And it s differentiable, whichaswesaw inSection ??will behandy for learning. 5 Figure7.1 nearly linear around0but outlier valuesget squashed toward0or 1. The sigmoid function takes a real value and maps it to the range [0,1]. It is
Final function the unit is computing 7.1 UNITS 3 Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 y= s(w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 and1. Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] b = 0.5 What wouldthisunit dowith thefollowinginput vector: x = [0.5,0.6,0.1] Theresulting output ywouldbe: 1 1 1 y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: tanh y=ez e z ez+ e z (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: ReLU y= max(x,0) (7.6)
Final unit again y Output value a Non-linear activation function z Weighted sum bias Weights Input layer w1 w2 w3 b x1 x2 x3 +1 7
7.1 UNITS 3 Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: 1 y= s (w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: w = [0.2,0.3,0.9] b = 0.5 An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with input x: x = [0.5,0.6,0.1] What would thisunit dowiththefollowinginput vector: x = [0.5,0.6,0.1] Theresulting output ywouldbe: 1 1 1 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: tanh y=ez e z ez+ e z (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: ReLU y= max(x,0) (7.6)
7.1 7.1 UNITS UNITS 3 3 Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 1 y= s (w x+ b) = y= s(w x+ b) = (7.4) (7.4) 1+ exp( (w x+ b)) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to Figure7.2 Let swalk through an examplejust to get an intuition. Let ssupposewehavea Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] b = 0.5 b = 0.5 An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with the following input x? x = [0.5,0.6,0.1] Theresulting output ywouldbe: What would thisunit dowiththefollowinginput vector: What wouldthisunit dowith thefollowing input vector: x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] Theresulting output ywouldbe: 1 1 1 1 1 1 y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly used asanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: y=ez e z ez+ e z tanh tanh y=ez e z ez+ e z (7.5) (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: y= max(x,0) ReLU ReLU (7.6) y= max(x,0) (7.6)
7.1 7.1 UNITS UNITS 3 3 Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 1 y= s (w x+ b) = y= s(w x+ b) = 7.1 (7.4) (7.4) 1+ exp( (w x+ b)) 1+ exp( (w x+ b)) UNITS 3 Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between 0 and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 y= s (w x+ b) = 1+ exp( (w x+ b)) Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: 1 (7.4) Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In this casetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa Figure7.2 Figure7.2 Let swalk through an examplejust to get an intuition. Let ssupposewehavea Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] b = 0.5 b = 0.5 Let swalk through an examplejust toget an intuition. Let ssupposewehavea unit with thefollowingweight vector andbias: An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with input x: x = [0.5,0.6,0.1] y= s (w x+ b) = w = [0.2,0.3,0.9] What would thisunit dowiththefollowinginput vector: What wouldthisunit dowith thefollowing input vector: b = 0.5 x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] What wouldthisunit dowith thefollowinginput vector: Theresulting output ywouldbe: Theresulting output ywouldbe: x = [0.5,0.6,0.1] 1 1 1 1 1 1 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 Theresulting output ywouldbe: y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 1+ e 0.87= .70 1 1 1 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= Inpractice, thesigmoid isnot commonly used asanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: y=ez e z ez+ e z y=ez e z ez+ e z tanh tanh Inpractice, thesigmoid isnot commonly usedasanactivation function. A function tanh y=ez e z ez+ e z (7.5) (7.5) (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: y= max(x,0) y= max(x,0) ReLU Thesimplest activation function, and perhapsthemost commonly used, istherec- ReLU ReLU (7.6) y= max(x,0) (7.6) (7.6)
7.1 UNITS 3 7.1 7.1 UNITS UNITS 3 3 7.1 UNITS 3 Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: Substituting Eq. 7.2 into Eq. 7.3givesustheoutput of aneural unit: y= s (w x+ b) = 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoid functiontoresult inanumber between 0 and1. 1 1 (7.4) 1 y= s (w x+ b) = y= s(w x+ b) = 7.1 (7.4) (7.4) 1+ exp( (w x+ b)) 1+ exp( (w x+ b)) UNITS 3 (7.4) 1 y= s (w x+ b) = Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between0 and1. and1. Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunctiontoresult inanumber between 0 and1. 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoidfunction toresult inanumber between 0 y= s (w x+ b) = 1+ exp( (w x+ b)) passestheresulting sumthroughasigmoid function toresult inanumber between 0 and 1. Substituting Eq. 7.2intoEq. 7.3givesustheoutput of aneural unit: Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valueby aweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen 1 (7.4) A neural unit, taking3inputsx1, x2, andx3(andabiasbthat werepresent asa A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to A neural unit, taking 3inputsx1, x2, andx3(andabiasbthat werepresent asa intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In this casetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector and bias: Figure7.2 Figure7.2 A neural unit, taking 3inputsx1, x2, andx3(and abiasbthat werepresent asa Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In this casetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leaving aastheactivationof anindividual node. A neural unit, taking 3inputsx1, x2, andx3(and abiasb that werepresent asa weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient Figure7.2 Let swalk through an examplejust to get an intuition. Let ssupposewehavea Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: unit with thefollowingweight vector andbias: Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector andbias: w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] b = 0.5 b = 0.5 Let swalk through an examplejust toget an intuition. Let ssupposewehavea unit with thefollowingweight vector andbias: w = [0.2,0.3,0.9] b = 0.5 b = 0.5 An example Suppose a unit has: w = [0.2,0.3,0.9] b = 0.5 What happens with input x: x = [0.5,0.6,0.1] Theresulting output ywouldbe: x = [0.5,0.6,0.1] w = [0.2,0.3,0.9] w = [0.2,0.3,0.9] What would thisunit dowiththefollowinginput vector: What wouldthisunit dowith thefollowing input vector: b = 0.5 What would thisunit dowith thefollowing input vector: What wouldthisunit dowith thefollowing input vector: x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] What wouldthisunit dowith thefollowinginput vector: Theresulting output ywouldbe: x = [0.5,0.6,0.1] x = [0.5,0.6,0.1] 1 1 1 1 1 1 Theresulting output ywouldbe: Theresulting output ywould be: Theresulting output ywouldbe: y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1 1+ e 0.87= .70 1+ e 0.87= .70 y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 1+ e 0.87= .70 1 1 1 1 1 1 1 1 y= s(w x+ b) = y= s (w x+ b) = y= s(w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= Inpractice, thesigmoid isnot commonly used asanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to+1: tanh isavariant of thesigmoid that rangesfrom -1to+1: tanh isavariant of thesigmoid that rangesfrom-1to+1: Inpractice, thesigmoid isnot commonly usedasanactivation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanhisavariant of thesigmoid that rangesfrom-1to+1: y=ez e z ez+ e z y=ez e z ez+ e z ez+ e z ez+ e z tanh tanh Inpractice, thesigmoid isnot commonly usedasanactivation function. A function Inpractice, thesigmoid isnot commonly used asan activation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; Inpractice, thesigmoid isnot commonly usedasanactivation function. A function tanh tanh tanh y=ez e z ez+ e z (7.5) (7.5) y=ez e z y=ez e z (7.5) (7.5) (7.5) Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and0otherwise: tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesame asx when xispositive, and 0otherwise: when xispositive, and0otherwise: Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx whenxispositive, and0otherwise: y= max(x,0) y= max(x,0) y= max(x,0) y= max(x,0) ReLU Thesimplest activation function, and perhapsthemost commonly used, istherec- Thesimplest activation function, and perhaps themost commonly used, istherec- Thesimplest activation function, and perhapsthemost commonly used, istherec- tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasx ReLU ReLU ReLU ReLU (7.6) y= max(x,0) (7.6) (7.6) (7.6) (7.6)
7.1 UNITS 3 Substituting Eq.7.2intoEq.7.3givesustheoutput of aneural unit: 1 y= s(w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2showsafinal schematic of abasic neural unit. Inthisexampletheunit takes3input valuesx1,x2, andx3, andcomputesaweighted sum, multiplying each valuebyaweight (w1,w2,andw3,respectively),addsthemtoabiastermb,andthen passestheresultingsumthroughasigmoidfunctiontoresult inanumber between0 and1. 7.1 UNITS 3 Substituting Eq. 7.2into Eq. 7.3givesustheoutput of aneural unit: 1 y= s (w x+ b) = (7.4) 1+ exp( (w x+ b)) Fig. 7.2 showsafinal schematic of abasic neural unit. In thisexampletheunit takes3 input valuesx1,x2, and x3, and computesaweighted sum, multiplying each valueby aweight (w1,w2,andw3,respectively), addsthemtoabiastermb,andthen passestheresulting sumthroughasigmoid function toresult inanumber between 0 and 1. intermediatevariables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networkswe ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. Figure7.2 weight for aninput clampedat +1) andproducing anoutput y. Weincludesomeconvenient A neural unit, taking3inputsx1, x2,andx3(andabiasbthat werepresent asa Let swalk throughanexamplejust toget anintuition. Let ssupposewehavea unit withthefollowingweight vector andbias: w = [0.2,0.3,0.9] b = 0.5 A neural unit, taking 3inputsx1, x2, andx3(and abiasbthat werepresent asa What wouldthisunit dowiththefollowinginput vector: Figure7.2 weight for an input clamped at +1) and producing an output y. Weincludesomeconvenient intermediate variables: theoutput of thesummation, z, and theoutput of thesigmoid, a. In thiscasetheoutput of theunit y isthesameasa, but in deeper networks we ll reservey to meanthefinal output of theentirenetwork, leavingaastheactivationof anindividual node. Theresultingoutput ywouldbe: x = [0.5,0.6,0.1] Let swalk through an examplejust to get an intuition. Let ssupposewehavea unit with thefollowing weight vector and bias: y= s(w x+ b) = 1 1 1 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= 1+ e 0.87= .70 w = [0.2,0.3,0.9] b = 0.5 Inpractice, thesigmoidisnot commonly usedasanactivationfunction. A function that isvery similar but almost alwaysbetter isthetanh functionshowninFig. 7.3a; tanhisavariant of thesigmoidthat rangesfrom-1to+1: What would thisunit dowith thefollowing input vector: tanh x = [0.5,0.6,0.1] y=ez e z ez+ e z Theresulting output ywould be: (7.5) 1 1 1 y= s (w x+ b) = 1+ e (w x+b)= 1+ e (.5 .2+.6 .3+.1 .9+.5)= Thesimplest activation function, and perhapsthemost commonly used, istherec- 1+ e 0.87= .70 Inpractice, thesigmoid isnot commonly used asan activation function. A function that isvery similar but almost alwaysbetter isthetanh function showninFig. 7.3a; tanh isavariant of thesigmoid that rangesfrom-1to +1: Non-Linear Activation Functions besides sigmoid tified linear unit, also called theReLU, shown in Fig. 7.3b. It sjust thesameasz whenzispositive,and0otherwise: Most Common: ReLU tanh y=ez e z ez+ e z y= max(z,0) (7.6) (7.5) Thesimplest activation function, and perhaps themost commonly used, istherec- tified linear unit, also called the ReLU, shown in Fig. 7.3b. It sjust the same as x when xispositive, and 0otherwise: ReLU y= max(x,0) (7.6) ReLU tanh Rectified Linear Unit 12
Simple Neural Networks and Neural Language Models Units in Neural Networks
Simple Neural Networks and Neural Language Models The XOR problem
4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. saturated vanishing gradient 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas The XOR problem the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: Minsky and Papert (1969) Can neural units compute simple functions of input? AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron
Perceptrons A very simple neural unit Binary output (0 or 1) No non-linear activation function
4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. valuesof zis1rather than very closeto 0. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high saturated saturated vanishing gradient vanishing gradient 7.2 TheXORproblem 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are Easy to build AND or OR with perceptrons works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: thetruth tablesfor thosefunctions: AND OR XOR AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 AND OR 1 1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron Thisexamplewasfirst shown for theperceptron, which isavery simpleneural perceptron
4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. valuesof zis1rather than very closeto 0. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high saturated saturated vanishing gradient vanishing gradient 7.2 TheXORproblem 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are Easy to build AND or OR with perceptrons works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: thetruth tablesfor thosefunctions: AND OR XOR AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 AND OR 1 1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron Thisexamplewasfirst shown for theperceptron, which isavery simpleneural perceptron
4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS 4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. valuesof zis1rather than very closeto 0. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high saturated saturated vanishing gradient vanishing gradient 7.2 TheXORproblem 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are Easy to build AND or OR with perceptrons works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: thetruth tablesfor thosefunctions: AND OR XOR AND OR XOR x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 AND OR 1 1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron Thisexamplewasfirst shown for theperceptron, which isavery simpleneural perceptron
Not possible to capture XOR with perceptrons Pause the lecture and try for yourself!
Why? Perceptrons are linear classifiers Perceptron equation given x1 and x2, is the equation of a line w1x1 + w2x2 + b = 0 (in standard linear format: x2= ( w1/w2)x1+ ( b/w2) ) This line acts as a decision boundary 0 if input is on one side of the line 1 if on the other side of the line
Decision boundaries x2 x2 x2 1 1 1 ? 0 0 0 x1 x1 x1 0 1 0 1 0 1 a) x1 AND x2 b) x1 OR x2 c) x1 XOR x2 XOR is not a linearly separable function!
4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. saturated vanishing gradient 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: Solution to the XOR problem XOR can't be calculated by a single perceptron XOR can be calculated by a layered network of units. y1 XOR x1 x2 y 0 0 0 0 1 1 1 0 1 1 1 0 ReLU AND OR -2 1 0 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 h2 h1 +1 ReLU 1 -1 0 1 1 1 x1 x2 +1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron
4 CHAPTER 7 NEURAL NETWORKS AND NEURAL LANGUAGE MODELS (a) (b) Figure7.3 Thetanh and ReLU activationfunctions. These activation functions have different properties that make them useful for different languageapplicationsornetwork architectures. Forexample, thetanhfunc- tion has the nice properties of being smoothly differentiable and mapping outlier valuestowardthemean. Therectifier function, ontheother handhasniceproperties that result from it being very close to linear. In thesigmoid or tanh functions, very high values of z result in values of y that are saturated, i.e., extremely close to 1, and havederivativesvery close to 0. Zero derivativescause problems for learning, because as we ll see in Section 7.4, we ll train networks by propagating an error signal backwards, multiplying gradients (partial derivatives) from each layer of the network; gradientsthat arealmost 0causetheerror signal toget smaller andsmaller until it istoo small tobeused for training, aproblem called thevanishing gradient problem. Rectifiers don t havethis problem, since thederivativeof ReLU for high valuesof zis1rather than very closeto 0. saturated vanishing gradient 7.2 TheXORproblem Early in thehistory of neural networksit wasrealized that thepower of neural net- works, as with the real neurons that inspired them, comes from combining these unitsinto larger networks. Oneof themost clever demonstrations of theneed for multi-layer networkswas the proof by Minsky and Papert (1969) that a single neural unit cannot compute somevery simplefunctionsof itsinput. Consider thetask of computing elementary logical functions of two inputs, likeAND, OR, and XOR. As areminder, here are thetruth tablesfor thosefunctions: Solution to the XOR problem XOR can't be calculated by a single perceptron XOR can be calculated by a layered network of units. y1 XOR x1 x2 y 0 0 0 0 1 1 1 0 1 1 1 0 AND OR -2 1 0 x1 x2 y 0 0 0 1 1 0 1 1 x1 x2 y 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 h2 h1 +1 1 -1 0 1 1 1 x1 x2 +1 Thisexamplewasfirst shown for theperceptron, which isavery simpleneural unit that hasabinary output anddoesnot haveanon-linear activation function. The perceptron
y1 The hidden representation h -2 1 0 h2 h1 +1 1 -1 0 1 1 1 x1 x2 +1 x2 h2 1 1 0 0 x1 h1 0 1 0 1 2 a) The original x space b) The new (linearly separable) h space (With learning: hidden layers will learn to form useful representations)
Simple Neural Networks and Neural Language Models The XOR problem
Simple Neural Networks and Neural Language Models Feedforward Neural Networks
Feedforward Neural Networks Can also be called multi-layer perceptrons (or MLPs) for historical reasons
Binary Logistic Regression as a 1-layer Network (we don't count the input layer in counting layers!) ? = ?(? ? + ?) Output layer ( node) (y is a scalar) w w1 wn b (scalar) (vector) x1 Input layer vector x xn +1 29
Multinomial Logistic Regression as a 1-layer Network Fully connected single layer network y1 s Output layer (softmax nodes) yn s s ? = softmax(?? + ?) y is a vector b W W is a matrix b is a vector xn x1 Input layer scalars +1 30
Reminder: softmax: a generalization of sigmoid For a vector z of dimensionality k, the softmax is: Example:
Two-Layer Network with scalar output ? = ?(?) y is a scalar Output layer ( node) z = ? U hidden units ( node) Could be ReLU Or tanh b W Input layer (vector) +1 x1 xn
Two-Layer Network with scalar output ? = ?(?) y is a scalar Output layer ( node) z = ? U j hidden units ( node) Wji W b vector Input layer (vector) i +1 x1 xn
Two-Layer Network with scalar output ? = ?(?) y is a scalar Output layer ( node) z = ? U hidden units ( node) Could be ReLU Or tanh b W Input layer (vector) +1 x1 xn
Two-Layer Network with softmax output ? = softmax(?) Output layer ( node) z = ? U y is a vector hidden units ( node) Could be ReLU Or tanh b W Input layer (vector) +1 x1 xn
Multi-layer Notation ? = ?[2] sigmoid or softmax ?[2]= ?2(?2) ?[2]= ?[2]?[1]+ ?[2] W[2 ] b[2] j ReLU ?[1]= ?1(?1) ?[1]= ?[1]?[0]+ ?[1] W[1 ] b[1] ?[0] i +1 x1 xn
y Multi Layer Notation a z w1 w2 w3 b x1 x2 x3 +1 37
Replacing the bias unit Let's switch to a notation without the bias unit Just a notational change 1. Add a dummy node a0=1 to each layer 2. Its weight w0 will be the bias 3. So input layer a[0]0=1, And a[1]0=1 , a[2]0=1,
Replacing the bias unit Instead of: We'll do this: x= x1, x2, , xn0 x= x0, x1, x2, , xn0
Replacing the bias unit Instead of: We'll do this: yn2 yn2 y1 y2 y1 y2 U U hn1 hn1 h2 h2 h3 h3 h1 h1 b W W x1 x2 xn0 x1 x2 xn0 x0=1 +1
Simple Neural Networks and Neural Language Models Feedforward Neural Networks
Simple Neural Networks and Neural Language Models Applying feedforward networks to NLP tasks
Use cases for feedforward networks Let's consider 2 (simplified) sample tasks: 1. Text classification 2. Language modeling State of the art systems use more powerful neural architectures, but simple models are useful to consider! 43
Classification: Sentiment Analysis We could do exactly what we did with logistic regression Input layer are binary features as before Output layer is 0 or 1 U W x1 xn
Feedforward nets for simple classification U 2-layer feedforward network Logistic Regression W W x1 xn x1 xn fn f1 f2 fn f1 f2 46 Just adding a hidden layer to logistic regression allows the network to use non-linear interactions between features which may (or may not) improve performance. 46
Even better: representation learning U The real power of deep learningcomes from the ability to learn features from the data Instead of using hand-built human- engineered features for classification Use learned representations like embeddings! W x1 xn en e1 e2 47
Neural Net Classification with embeddings as input features! 48
Issue: texts come in different sizes This assumes a fixed size length (3)! Kind of unrealistic. Some simple solutions (more sophisticated solutions later) 1. Make the input the length of the longest review If shorter then pad with zero embeddings Truncate if you get longer reviews at test time 2. Create a single "sentence embedding" (the same dimensionality as a word) to represent all the words Take the mean of all the word embeddings Take the element-wise max of all the word embeddings For each dimension, pick the max value from all words 49
Reminder: Multiclass Outputs What if you have more than two output classes? Add more output units (one for each class) And use a softmax layer U W xn x1 50