Understanding Recurrent Neural Networks: RNNs and LSTMs

simple recurrent networks rnns or elman nets n.w
1 / 59
Embed
Share

Dive into the world of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) models for modeling time in neural networks. Explore the concepts of Simple Recurrent Networks, Elman Nets, and forward inference in simple RNNs. Learn about the inherent temporal nature of language and how RNNs handle time representation through cycles within network connections.

  • RNNs
  • LSTMs
  • Neural Networks
  • Time Modeling

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Simple Recurrent Networks (RNNs or Elman Nets) RNNs and LSTMs

  2. Modeling Time in Neural Networks Language is inherently temporal Yet the simple NLP classifiers we've seen (for example for sentiment analysis) mostly ignore time (Feedforward neural LMs (and the transformers we'll see later) use a "moving window" approach to time.) Here we introduce a deep learning architecture with a different way of representing time RNNs and their variants like LSTMs

  3. Recurrent Neural Networks (RNNs) Any network that contains a cycle within its network connections. The value of some unit is directly, or indirectly, dependent on its own earlier outputs as an input.

  4. Simple Recurrent Nets (Elman nets) yt ht xt The hidden layer has a recurrence as part of its input The activation value ht depends on xt but also ht-1!

  5. Forward inference in simple RNNs Very similar to the feedforward networks we've seen!

  6. 8.1 RECURRENT NEURAL NETWORKS 3 8.1 RECURRENT NEURAL NETWORKS 3 Figure8.2 den layer ht 1from theprior timestep ismultiplied by weight matrix U and then added to thefeedforwardcomponent fromthecurrent timestep. Simplerecurrent neural network illustratedasafeedforwardnetwork. Thehid- output vector. ht = g(Uht 1+ Wxt) yt = f(Vht) (8.1) (8.2) Simple recurrent neural network illustrated as a feedforward network Simplerecurrent neural network illustrated asafeedforwardnetwork. Thehid- den layer ht 1from theprior timestep ismultiplied by weight matrix U and then added to thefeedforwardcomponent fromthecurrent timestep. respectively. Giventhis,ourthreeparameter matricesare: W2 Rdh din,U2 Rdh dh, andV 2 Rdout dh. Wecomputeytviaasoftmax computation that givesaprobability distribution over thepossibleoutput classes. Figure8.2 Let s refer to the input, hidden and output layer dimensions as din, dh, and dout yt V ht + U W output vector. xt ht-1 ht = g(Uht 1+ Wxt) yt = f(Vht) yt = softmax(Vht) (8.1) (8.2) (8.3) Let s refer to the input, hidden and output layer dimensions as din, dh, and dout respectively. Giventhis, our threeparameter matricesare: W2 Rdh din,U 2 Rdh dh, andV 2 Rdout dh. We compute ytvia a softmax computation that gives a probability distribution over thepossibleoutput classes. Fig. 8.4. In thisfigure, thevariouslayersof unitsarecopied for each timestep to illustratethat theywill havedifferingvaluesover time. However,thevariousweight matricesaresharedacrosstime. Thefact that thecomputation at timet requiresthevalueof thehidden layer from timet 1mandatesanincremental inferencealgorithmthat proceedsfromthestart of thesequencetotheendasillustrated inFig. 8.3. Thesequential natureof simple recurrent networkscanalsobeseenby unrollingthenetwork intimeasisshownin yt = softmax(Vht) (8.3) Thefact that thecomputation at timet requires thevalue of thehidden layer from timet 1mandatesanincremental inferencealgorithm that proceedsfromthestart of thesequencetotheend asillustrated in Fig. 8.3. Thesequential natureof simple recurrent networkscan also beseen by unrolling thenetwork in timeasisshownin Fig. 8.4. In this figure, thevarious layers of units arecopied for each timestep to illustratethat they will havedifferingvaluesover time. However,thevariousweight matricesareshared acrosstime. return y function FORWARDRNN(x,network) returnsoutput sequencey h00 for i 1toLENGTH(x) do hi g(Uhi 1+ Wxi) yi f(Vhi) function FORWARDRNN(x,network) returnsoutput sequencey Figure8.3 Forwardinferenceinasimplerecurrent network. ThematricesU,V andW are sharedacrosstime, whilenewvaluesfor h andy arecalculatedwitheachtimestep. h00 for i 1to LENGTH(x) do hi g(Uhi 1+ Wxi) yi f(Vhi) return y Aswith feedforward networks, we ll useatraining set, alossfunction, and back- propagation to obtain thegradientsneeded to adjust theweights in theserecurrent networks. Asshown in Fig. 8.2, wenow have3 setsof weightsto update: W, the 8.1.2 Training Figure8.3 shared acrosstime, whilenew valuesfor h andy arecalculated with eachtimestep. Forwardinferenceinasimplerecurrent network. ThematricesU, V andW are 8.1.2 Training As with feedforward networks, we ll use atraining set, aloss function, and back- propagation to obtain the gradients needed to adjust theweights in these recurrent networks. Asshown in Fig. 8.2, wenow have3 setsof weights to update: W, the

  7. 8.1 RECURRENT NEURAL NETWORKS 3 Figure8.2 den layer ht 1from theprior timestep ismultiplied by weight matrix U and then added to thefeedforward component from thecurrent timestep. Simplerecurrent neural network illustrated asafeedforwardnetwork. Thehid- output vector. ht = g(Uht 1+ Wxt) yt = f(Vht) (8.1) (8.2) Let s refer to the input, hidden and output layer dimensions as din, dh, and dout respectively. Giventhis, our threeparameter matricesare: W 2 Rdh din,U 2 Rdh dh, andV 2 Rdout dh. We compute ytvia a softmax computation that gives a probability distribution over thepossibleoutput classes. yt = softmax(Vht) (8.3) Thefact that the computation at timet requires the value of the hidden layer from timet 1mandatesan incremental inferencealgorithm that proceedsfrom thestart of thesequenceto theend asillustrated in Fig. 8.3. Thesequential natureof simple Inference has to be incremental recurrent networkscan also beseen by unrolling thenetwork in timeasisshown in Fig. 8.4. In this figure, the various layers of units are copied for each time step to illustratethat they will havedifferingvaluesover time. However, thevariousweight matricesareshared acrosstime. Computing h at time t requires that we first computed h at the previous time step! function FORWARDRNN(x,network) returnsoutput sequencey h00 for i 1to LENGTH(x) do hi g(Uhi 1+ Wxi) yi f(Vhi) return y Figure8.3 shared acrosstime, whilenew valuesfor h and y arecalculated witheachtimestep. Forwardinferenceinasimplerecurrent network. ThematricesU, V andW are 8.1.2 Training As with feedforward networks, we ll use a training set, a loss function, and back- propagation to obtain the gradients needed to adjust the weights in these recurrent networks. Asshown in Fig. 8.2, wenow have3 sets of weights to update: W, the

  8. Training in simple RNNs yt Just like feedforward training: training set, a loss function, backpropagation V ht + U W xt ht-1 Weights that need to be updated: W, the weights from the input layer to the hidden layer, U, the weights from the previous hidden layer to the current hidden layer, V, the weights from the hidden layer to the output layer.

  9. Training in simple RNNs: unrolling in time Unlike feedforward networks: 1. To compute loss function for the output at time t we need the hidden layer from time t 1. y3 V y2 h3 2. hidden layer at time t influences the output at time t and hidden layer at time t+1 (and hence the output and loss at t+1). W V U h2 y1 x3 W U V So: to measure error accruing to ht, need to know its influence on both the current output as well as the ones that follow. h1 x2 W U h0 x1

  10. y3 V Unrolling in time (2) y2 h3 W V U h2 y1 x3 W U V h1 x2 W U h0 x1 We unroll a recurrent network into a feedforward computational graph eliminating recurrence 1. Given an input sequence, 2. Generate an unrolled feedforward network specific to input 3. Use graph to train weights directly via ordinary backprop (or can do forward inference)

  11. Simple Recurrent Networks (RNNs or Elman Nets) RNNs and LSTMs

  12. RNNs as Language Models RNNs and LSTMs

  13. Reminder: Language Modeling

  14. The size of the conditioning context for different LMs The n-gram LM: Context size is the n 1 prior words we condition on. The feedforward LM: Context is the window size. The RNN LM: No fixed context size; ht-1 represents entire history

  15. FFN LMs vs RNN LMs ^yt ^yt a) a) b) b) ^yt ^yt U U V V ht ht ht-2 ht-1 ht-2 ht ht-1 ht U U U U W W W W W W W W et-2 et-1 et et-2 et-1 et et-2 et-1 et-2 et et-1 et FFN RNN

  16. Forward inference in the RNN LM Given input X of of N tokens represented as one-hot vectors Use embedding matrix to get the embedding for current token xt Combine

  17. ^yt a) Shapes b) ^yt |V| x 1 U |V| x d V ht d x d ht-2 ht-1 ht d x 1 d x 1 U U W d x d W W W et-2 et-1 et et-2 et-1 et d x 1

  18. 6 CHAPTER 8 RNNS AND LSTMS 6 CHAPTER 8 RNNS AND LSTMS Figure8.5 schematiccontext of threetokens: (a) afeedforwardneural languagemodel whichhasafixed context input totheweight matrix W, (b) anRNN languagemodel, inwhich thehidden state ht 1summarizestheprior context. Simplified sketch of two LM architectures moving through a text, showing a Figure8.5 schematiccontext of threetokens: (a) afeedforwardneural languagemodel whichhasafixed context input totheweight matrix W, (b) anRNN languagemodel, inwhichthehiddenstate ht 1summarizestheprior context. Simplified sketch of two LM architectures moving through a text, showing a When wedo language modeling with RNNs (and we ll seethis again in Chapter 9 with transformers), it s convenient to makethe assumption that the embedding di- mension deand the hidden dimension dhare the same. So we ll just call both of thesethemodel dimension d. Sotheembedding matrix E isof shape[d |V|], and xtisaone-hot vector of shape[|V| 1]. Theproduct etisthusof shape[d 1]. W and U are of shape [d d], so htis also of shape [d 1]. V is of shape [|V| d], so the result of Vh isavector of shape [|V| 1]. This vector can bethought of as aset of scoresover thevocabulary given theevidenceprovided in h. Passing these scoresthroughthesoftmax normalizesthescoresintoaprobability distribution. The probability that aparticular wordk inthevocabulary isthenext wordisrepresented by yt[k], thekth component of yt: so theresult of Vh isavector of shape[|V| 1]. Thisvector can bethought of as aset of scoresover thevocabulary given theevidenceprovided in h. Passing these scoresthroughthesoftmax normalizesthescoresintoaprobability distribution. The probability that aparticular wordkinthevocabulary isthenext wordisrepresented by yt[k], thekthcomponent of yt: When wedo languagemodeling with RNNs(and we ll seethisagain in Chapter 9 with transformers), it sconvenient to maketheassumption that theembedding di- mension deand the hidden dimension dhare the same. So we ll just call both of thesethemodel dimension d. Sotheembedding matrix E isof shape[d |V|], and xtisaone-hot vector of shape[|V| 1]. Theproduct etisthusof shape[d 1]. W and U areof shape[d d], so htisalso of shape[d 1]. V isof shape [|V| d], P(wt+1= k|w1,...,wt) = yt[k] (8.7) Theprobability of an entiresequenceisjust theproduct of theprobabilities of each Computing the probability that the next word is word k item inthesequence, wherewe ll use yi[wi] tomeantheprobability of thetrueword wiat timestep i. P(wt+1= k|w1,...,wt) = yt[k] (8.7) n Y Theprobability of anentiresequenceisjust theproduct of theprobabilities of each iteminthesequence, wherewe ll use yi[wi] tomeantheprobability of thetrueword wiat timestepi. P(w1:n) = P(wi|w1:i 1) (8.8) i=1 n Y = yi[wi] (8.9) n Y i=1 P(wi|w1:i 1) P(w1:n) = (8.8) 8.2.2 Training an RNN languagemodel i=1 n Y = yi[wi] To train an RNN as a language model, we use the same self-supervision (or self- training) algorithm we saw in Section ??: we take a corpus of text as training material and at each time step t ask the model to predict the next word. We call such amodel self-supervised because wedon t haveto add any special gold labels to the data; thenatural sequence of wordsis its own supervision! Wesimply train the model to minimize the error in predicting the true next word in the training sequence, using cross-entropy as the loss function. Recall that the cross-entropy loss measures the difference between a predicted probability distribution and the such amodel self-supervised because wedon t haveto add any special gold labels to thedata; thenatural sequence of wordsisitsown supervision! Wesimply train the model to minimize the error in predicting the true next word in the training sequence, using cross-entropy as the loss function. Recall that the cross-entropy loss measures the difference between a predicted probability distribution and the (8.9) self-supervision i=1 8.2.2 Trainingan RNN languagemodel To train an RNN as alanguage model, weusethe same self-supervision (or self- training) algorithm we saw in Section ??: we take a corpus of text as training material and at each time step t ask the model to predict the next word. We call self-supervision

  19. Training RNN LM Self-supervision take a corpus of text as training material at each time step t ask the model to predict the next word. Why called self-supervised: we don't need human labels; the text is its own supervision signal We train the model to minimize the error in predicting the true next word in the training sequence, using cross-entropy as the loss function.

  20. 8.2 RNNS ASLANGUAGE MODELS 7 8.2 RNNS AS LANGUAGE MODELS 7 Figure8.6 TrainingRNNsaslanguagemodels. correct distribution. X Figure8.6 Training RNNsaslanguagemodels. LCE = yt[w]log yt[w] (8.10) Cross-entropy loss correct distribution. Inthecaseof languagemodeling,thecorrect distributionytcomesfromknowingthe next word. Thisisrepresented asaone-hot vector corresponding to thevocabulary wheretheentry for theactual next word is1, and all theother entriesare0. Thus, the cross-entropy loss for language modeling is determined by the probability the model assignsto thecorrect next word. So at timet theCElossisthenegativelog probability themodel assignstothenext wordinthetraining sequence. w2V X The difference between: a predicted probability distribution the correct distribution. CE loss for LMs is simpler!!! the correct distribution yt is a one-hot vector over the vocabulary where the entry for the actual next word is 1, and all the other entries are 0. So the CE loss for LMs is only determined by the probability of next word. So at time t, CE loss is: model assignsto thecorrect next word. So at timet theCE lossisthenegativelog probability themodel assignstothenext wordinthetraining sequence. Thusateachwordpositiont of theinput,themodel takesasinputthecorrect wordwt together with ht 1, encoding information fromthepreceding w1:t 1, and usesthem tocomputeaprobability distribution over possiblenext wordssoastocomputethe model sloss for thenext token wt+1. Then wemoveto the next word, weignore what themodel predicted for thenext word and instead usethecorrect word wt+1 along withtheprior history encoded toestimatetheprobability of tokenwt+2. This ideathat wealwaysgivethemodel thecorrect history sequenceto predict thenext word (rather than feeding the model its best case from the previous time step) is called teacher forcing. teacher forcing Theweights in thenetwork areadjusted to minimize theaverageCE lossover thetraining sequenceviagradient descent. Fig. 8.6illustratesthistraining regimen. LCE = yt[w]log yt[w] (8.10) w2V Inthecaseof languagemodeling, thecorrect distributionytcomesfromknowingthe next word. Thisisrepresented asaone-hot vector corresponding to thevocabulary wheretheentry for theactual next word is1, and all theother entries are0. Thus, the cross-entropy loss for language modeling is determined by the probability the LCE( yt,yt) = log yt[wt+1] (8.11) LCE( yt,yt) = log yt[wt+1] (8.11) Thusateachwordpositiont of theinput,themodel takesasinputthecorrect wordwt together with ht 1, encoding information from thepreceding w1:t 1, and usesthem to computeaprobability distribution over possiblenext wordsso asto computethe model s loss for the next token wt+1. Then we move to the next word, we ignore what the model predicted for thenext word and instead usethe correct word wt+1 along with theprior history encoded to estimatetheprobability of token wt+2. This ideathat wealwaysgivethemodel thecorrect history sequence to predict thenext word (rather than feeding the model its best case from the previous time step) is called teacher forcing. Theweights in thenetwork are adjusted to minimize theaverage CE loss over thetraining sequenceviagradient descent. Fig. 8.6illustrates thistraining regimen. Weight Tying teacher forcing 8.2.3 Careful readers may havenoticed that theinput embedding matrix E and thefinal layer matrix V, whichfeedstheoutput softmax, arequitesimilar. Thecolumnsof E represent theword embeddings for each word in thevocab- ulary learned during thetraining processwith thegoal that wordsthat havesimilar meaningandfunctionwill havesimilar embeddings. And,sincewhenweuseRNNs for languagemodeling wemaketheassumption that theembedding dimension and 8.2.3 Weight Tying Careful readers may havenoticed that the input embedding matrix E and the final layer matrix V, which feedstheoutput softmax, arequitesimilar. Thecolumnsof E represent theword embeddings for each word in thevocab- ulary learned during thetraining processwith thegoal that wordsthat havesimilar meaningandfunctionwill havesimilar embeddings. And, sincewhenweuseRNNs for languagemodeling wemaketheassumption that theembedding dimension and

  21. Teacher forcing We always give the model the correct history to predict the next word (rather than feeding the model the possible buggy guess from the prior time step). This is called teacher forcing (in training we force the context to be correct based on the gold words) What teacher forcing looks like: At word position t the model takes as input the correct word wt together with ht 1, computes a probability distribution over possible next words That gives loss for the next token wt+1 Then we move on to next word, ignore what the model predicted for the next word and instead use the correct word wt+1 along with the prior history encoded to estimate the probability of token wt+2.

  22. Weight tying The input embedding matrix E and the final layer matrix V, are similar The columns of E represent the word embeddings for each word in vocab. E is [d x |V|] The final layer matrix V helps give a score (logit) for each word in vocab . V is [|V| x d ] Instead of having separate E and V, we just tie them together, using ET instead of V:

  23. RNNs as Language Models RNNs and LSTMs

  24. RNNs for Sequences RNNs and LSTMs

  25. RNNs for sequence labeling Assign a label to each element of a sequence Part-of-speech tagging NNP y MD VB DT NN Argmax Softmax over tags Vh h RNN Layer(s) Embeddings e Janet will back the bill Words

  26. 10 CHAPTER 8 RNNS AND LSTMS RNNs for sequence classification Text classification Softmax FFN hn Figure8.8 work. Thefinal hiddenstatefromtheRNN isusedastheinput toafeedforwardnetwork that performstheclassification. Sequenceclassification using asimpleRNN combined withafeedforwardnet- RNN x1 x2 x3 xn poolsall thenhidden statesby taking their element-wise mean: Instead of taking the last state, could use some pooling function of all the output states, like mean pooling n X hmean=1 hi (8.15) n i=1 Or wecan taketheelement-wise max; theelement-wisemax of aset of nvectorsis anew vector whosekthelement isthemax of thekthelementsof all thenvectors. Thelongcontextsof RNNsmakesit quitedifficult tosuccessfully backpropagate error all theway through theentire input; we ll talk about this problem, and some standard solutions, inSection 8.5. 8.3.3 Generation with RNN-Based LanguageModels RNN-based language models can also be used to generate text. Text generation is of enormous practical importance, part of tasks like question answering, machine translation, text summarization, grammar correction, story generation, and conver- sational dialogue; any task where a system needs to produce text, conditioned on some other text. Thisuse of alanguage model to generate text isone of the areas in which theimpact of neural language models on NLPhas been thelargest. Text generation, along with imagegeneration and codegeneration, constituteanew area of AI that isoften called generativeAI. Recall back inChapter 3wesaw how togeneratetext from an n-gram language model byadaptingasamplingtechniquesuggestedatabout thesametimebyClaude Shannon (Shannon, 1951) and the psychologists George Miller and Jennifer Self- ridge (Miller and Selfridge, 1950). We first randomly sample a word to begin a sequence based on its suitability as the start of a sequence. We then continue to samplewordsconditioned on our previouschoicesuntil wereach apre-determined length, or an endof sequencetokenisgenerated. Today,thisapproachof usingalanguagemodel toincrementally generatewords by repeatedly sampling thenext word conditioned on our previouschoicesiscalled autoregressive generation or causal LM generation. The procedure is basically thesameasthat described onpage??, but adapted toaneural context: autoregressive generation Sample a word in the output from the softmax distribution that results from using thebeginning of sentencemarker, <s>, asthefirst input.

  27. Autoregressive generation So long and ? Sampled Word Softmax RNN Embedding <s> So long and Input Word

  28. Stacked RNNs y1 y2 y3 yn RNN 3 RNN 2 RNN 1 x1 x2 x3 xn

  29. 12 CHAPTER 8 RNNS AND LSTMS theentiresequenceof outputsfrom oneRNN asan input sequenceto another one. Stacked RNNsconsist of multiplenetworkswheretheoutput of onelayer servesas theinput toasubsequent layer, asshowninFig. 8.10. Stacked RNNs 12 CHAPTER 8 RNNS AND LSTMS theentiresequenceof outputsfrom oneRNN asan input sequenceto another one. Stacked RNNsconsist of multiplenetworkswheretheoutput of onelayer servesas theinput toasubsequent layer, asshowninFig. 8.10. Stacked RNNs Figure8.10 higher levelswiththeoutput of thelast network servingasthefinal output. Stackedrecurrent networks. Theoutput of alower level servesastheinput to Stacked RNNsgenerally outperform single-layer networks. Onereason for this success seems to bethat thenetwork induces representations at differing levels of abstraction acrosslayers. Just astheearly stagesof thehuman visual system detect Stackedrecurrent networks. Theoutput of alower level servesastheinput to edges that are then used for finding larger regions and shapes, theinitial layers of Figure8.10 higher levelswiththeoutput of thelast network serving asthefinal output. stacked networks can induce representations that serve as useful abstractions for further layers representations that might provedifficult toinduceinasingleRNN. The optimal number of stacked RNNs is specific to each application and to each training set. However, as the number of stacks is increased the training costs rise quickly. StackedRNNsgenerally outperform single-layer networks. Onereason for this success seems to bethat thenetwork induces representations at differing levels of abstraction acrosslayers. Just astheearly stagesof thehuman visual system detect edges that arethen used for finding larger regions and shapes, theinitial layers of stacked networks can induce representations that serve as useful abstractions for further layers representations that might provedifficult toinduceinasingleRNN. The optimal number of stacked RNNs is specific to each application and to each training set. However, as the number of stacks is increased the training costs rise quickly. thosecaseswewould liketo usewordsfrom thecontext to theright of t. Oneway to do thisisto run two separate RNNs, oneleft-to-right, and oneright-to-left, and concatenatetheir representations. Intheleft-to-right RNNswe vediscussed sofar,thehiddenstateat agiventime t representseverything thenetwork knowsabout thesequenceuptothat point. The stateisafunction of theinputsx1,...,xtandrepresentsthecontext of thenetwork to theleft of thecurrent time. 8.4 STACKED AND BIDIRECTIONAL RNN ARCHITECTURES 8.4.2 Bidirectional RNNs The RNN uses information from the left (prior) context to makeits predictions at time t. But in many applications we have access to the entire input sequence; in 8.4.2 Bidirectional RNNs TheRNN uses information from theleft (prior) context to makeits predictions at time t. But in many applications we have access to the entire input sequence; in thosecaseswewould liketo usewordsfromthecontext to theright of t. Oneway to do thisisto run two separate RNNs, oneleft-to-right, and oneright-to-left, and concatenatetheir representations. Intheleft-to-right RNNswe vediscussed sofar, thehiddenstateat agiventime t representseverything thenetwork knowsabout thesequenceuptothat point. The stateisafunction of theinputsx1,...,xtandrepresentsthecontext of thenetwork to theleft of thecurrent time. theend to thestart. Wethen concatenate thetwo representations computed by the networksintoasinglevector that capturesboththeleft andright contextsof aninput 13 hf t = RNNforward(x1,...,xt) trepresentsall theinformation wehavediscerned about the sequencefromt totheend of thesequence. A bidirectional RNN (Schuster and Paliwal, 1997) combines two independent RNNs, onewheretheinput isprocessed fromthestart totheend, andtheother from (8.16) Here, thehidden statehb Bidirectional RNNs tsimply correspondstothenormal hiddenstateat timet,repre- senting everything thenetwork hasgleaned fromthesequencesofar. To take advantage of context to the right of the current input, we can train an RNN on areversed input sequence. With this approach, thehidden state at timet representsinformation about thesequencetotheright of thecurrent input: at eachpoint intime. Hereweuseeither thesemicolon ; or theequivalent symbol tomean vector concatenation: Thisnew notation hf bidirectional RNN y1 y2 y3 yn concatenated outputs hf t = RNNforward(x1,...,xt) (8.16) hb Thisnew notationhf senting everything thenetwork hasgleaned fromthesequencesofar. To takeadvantage of context to the right of the current input, we can train an RNN on areversed input sequence. With this approach, thehidden state at timet representsinformation about thesequencetotheright of thecurrent input: t= RNNbackward(xt,... xn) (8.17) tsimply correspondstothenormal hiddenstateat timet,repre- RNN 2 ht = [hf = hf t; hb t hb t] RNN 1 (8.18) t x1 x2 xn x3 hb t= RNNbackward(xt,... xn) (8.17) Fig. 8.11 illustrates such a bidirectional network that concatenates the outputs of the forward and backward pass. Other simple ways to combine the forward and backward contexts include element-wise addition or multiplication. The output at each step in timethuscapturesinformation to theleft and to theright of thecurrent input. Insequencelabeling applications, theseconcatenated outputscanserveasthe basisfor alocal labeling decision. Figure8.11 directions, with the output of each model at each time point concatenated to represent the bidirectional stateat that timepoint. A bidirectional RNN.Separatemodelsaretrainedintheforwardandbackward Bidirectional RNNshavealso proven to bequiteeffectivefor sequenceclassifi- cation. Recall fromFig. 8.8that for sequenceclassification weusedthefinal hidden state of the RNN as the input to a subsequent feedforward classifier. A difficulty with this approach is that the final state naturally reflects more information about the end of the sentence than its beginning. Bidirectional RNNs provide a simple solution tothisproblem; asshown inFig. 8.12, wesimply combinethefinal hidden states from the forward and backward passes (for example by concatenation) and usethat asinput for follow-on processing.

  30. Bidirectional RNNs for classification Softmax FFN h1 hn RNN 2 h1 RNN 1 hn x1 x2 xn x3

  31. RNNs for Sequences RNNs and LSTMs

  32. The LSTM RNNs and LSTMs

  33. Motivating the LSTM: dealing with distance It's hard to assign probabilities accurately when context is very far away: The flights the airline was canceling were full. Hidden layers are being forced to do two things: Provide information useful for the current decision, Update and carry forward information required for future decisions. Another problem: During backprop, we have to repeatedly multiply gradients through time and many h's The "vanishing gradient" problem

  34. The LSTM: Long short-term memory network LSTMs divide the context management problem into two subproblems: removing information no longer needed from the context, adding information likely to be needed for later decision making LSTMs add: explicit context layer Neural circuits with gates to control information flow

  35. Forget gate Deletes information from the context that is no longer needed.

  36. Regular passing of information

  37. Add gate Selecting information to add to current context Add this to the modified context vector to get our new context vector.

  38. 16 CHAPTER 8 RNNS AND LSTMS Figure8.13 current input, x, theprevioushidden state, ht 1, and thepreviouscontext, ct 1. Theoutputs areanew hidden state, htandanupdated context, ct. A single LSTM unit displayed asacomputation graph. Theinputs to each unit consists of the Thefinal gatewe ll useistheoutput gatewhichisusedtodecidewhat informa- tion isrequired for thecurrent hidden state (asopposed to what information needs tobepreservedfor futuredecisions). output gate Output gate Decide what information is required for the current hidden state (as opposed to what information needs to be preserved for future decisions). ot = s (Uoht 1+ Woxt) ht = ot tanh(ct) (8.26) (8.27) Fig. 8.13 illustrates thecomplete computation for asingle LSTM unit. Given the appropriate weights for the various gates, an LSTM accepts as input the context layer, and hidden layer from the previous time step, along with the current input vector. It then generatesupdated context andhidden vectorsasoutput. It isthehiddenstate, ht,that providestheoutput for theLSTM at eachtimestep. Thisoutput canbeusedastheinput tosubsequent layersinastackedRNN, or at the final layer of anetwork htcanbeusedtoprovidethefinal output of theLSTM. 8.5.1 Gated Units, Layersand Networks Theneural unitsused inLSTMsareobviously muchmorecomplex than thoseused in basic feedforward networks. Fortunately, thiscomplexity isencapsulated within thebasic processing units, allowing usto maintain modularity and to easily exper- iment with different architectures. To seethis, consider Fig. 8.14 which illustrates theinputsandoutputsassociated with each kind of unit. At thefar left, (a) isthebasicfeedforwardunit whereasingleset of weightsand asingleactivation function determineitsoutput, and when arranged inalayer there are no connections among the units in the layer. Next, (b) represents the unit in a simplerecurrent network. Now therearetwoinputsandanadditional set of weights togowith it. However, thereisstill asingleactivation function andoutput. The increased complexity of the LSTM units is encapsulated within the unit itself. Theonlyadditional external complexity fortheLSTM overthebasicrecurrent unit (b) isthepresenceof theadditional context vector asaninput andoutput. Thismodularity iskey tothepower andwidespreadapplicability of LSTM units. LSTM units(orother varieties, likeGRUs)canbesubstitutedintoanyof thenetwork architectures described in Section 8.4. And, as with simple RNNs, multi-layered networksmaking useof gatedunitscanbeunrolled intodeepfeedforwardnetworks

  39. The LSTM ct-1 ct-1 f ct ct + + ht-1 ht-1 tanh tanh + g ht ht i + xt xt LSTM o +

  40. Units FFN SRN LSTM

  41. The LSTM RNNs and LSTMs

  42. The LSTM Encoder-Decoder Architecture RNNs and LSTMs

  43. Four architectures for NLP tasks with RNNs

  44. 3 components of an encoder-decoder 1. An encoder that accepts an input sequence, x1:n, and generates a corresponding sequence of contextualized representations, h1:n. 2. A context vector, c, which is a function of h1:n, and conveys the essence of the input to the decoder. 3. A decoder, which accepts c as input and generates an arbitrary length sequence of hidden states h1:m, from which a corresponding sequence of output states y1:m, can be obtained

  45. Encoder-decoder

  46. Encoder-decoder for translation Regular language modeling

  47. Encoder-decoder for translation Let x be the source text plus a separate token <s> and y the target Let x = The green witch arrive <s> Let y = llego la bruja verde

  48. Encoder-decoder simplified Target Text la bruja verde </s> lleg softmax (output of source is ignored) hidden layer(s) hn embedding layer the green witch arrived <s> la bruja verde lleg Separator Source Text

  49. 20 CHAPTER 8 RNNS AND LSTMS Figure8.17 proachtomachinetranslation. Sourceandtarget sentencesareconcatenatedwithaseparator tokeninbetween, andthedecoder usescontext informationfromtheencoder slast hiddenstate. Translating asinglesentence(inferencetime) inthebasic RNN version of encoder-decoder ap- Fig. 8.17 showsan English source text ( thegreen witch arrived ), asentence separator token (<s>, and aSpanish target text ( lleg o la bruja verde ). To trans- late asource text, we run it through the network performing forward inference to generatehidden statesuntil weget totheendof thesource. Then webeginautore- gressivegeneration, asking for aword in thecontext of thehidden layer from the end of thesource input as well asthe end-of-sentence marker. Subsequent words are conditioned on the previous hidden state and the embedding for the last word generated. Let sformalizeandgeneralizethismodel abit inFig. 8.18. (Tohelpkeepthings straight, we ll usethesuperscripts eand d whereneeded to distinguish thehidden states of the encoder and the decoder.) The elements of the network on the left processtheinput sequencexandcomprisetheencoder. Whileour simplified figure shows only a single network layer for the encoder, stacked architectures are the norm, wheretheoutput statesfrom thetop layer of thestack aretaken asthefinal representation, andtheencoder consistsof stackedbiLSTMswherethehiddenstates fromtop layersfromtheforwardand backwardpassesareconcatenated to provide thecontextualized representationsfor eachtimestep. Theentirepurposeof theencoder istogenerateacontextualized representation of theinput. Thisrepresentation isembodiedinthefinal hiddenstateof theencoder, he The simplest version of the decoder network would take this state and use it just to initialize the first hidden state of the decoder; the first decoder RNN cell would usec asitsprior hidden statehd generates a sequence of outputs, an element at a time, until an end-of-sequence marker isgenerated. Each hidden stateisconditioned on theprevioushidden state andtheoutput generated inthepreviousstate. AsFig.8.18shows,wedosomethingmorecomplex: wemakethecontextvector cavailabletomorethanjustthefirstdecoder hiddenstate,toensurethattheinfluence of thecontext vector, c, doesn t waneastheoutput sequence isgenerated. Wedo thisby addingcasaparameter tothecomputation of thecurrent hiddenstate. using thefollowingequation: n. Thisrepresentation, alsocalledc for context, isthenpassedtothedecoder. 0. Thedecoder would then autoregressively Encoder-decoder showing context hd t= g( yt 1,hd t 1,c) (8.32) Decoder y1 y2 y3 y4 </s> (output is ignored during encoding) softmax hd 1 hd 2 hd 3 hd 4 hd m he1 he2 he3 hen = c = hd0 hidden layer(s) hn embedding layer x1 x2 x3 xn <s> y1 y2 y3 ym Encoder

  50. 8.7 THE ENCODER-DECODER MODEL WITH RNNS 21 Figure8.18 encoder-decoder architecture. The final hidden state of the encoder RNN, he decoder initsroleashd A more formal version of translating a sentence at inference time in the basic RNN-based n, serves as the context for the 0inthedecoder RNN, andisalso madeavailabletoeach decoder hidden state. Now we reready toseethefull equationsfor thisversionof thedecoder inthebasic encoder-decoder model, with context available at each decoding timestep. Recall that gisastand-in for someflavor of RNN and yt 1istheembedding for theoutput Encoder-decoder equations sampled fromthesoftmax at thepreviousstep: c = he hd hd t yt = softmax(hd n 0= c = g( yt 1,hd t 1,c) t) (8.33) g is a stand-in for some flavor of RNN Thus ytisavector of probabilitiesover thevocabulary, representing theprobability of each word occurring at timet. To generatetext, wesamplefrom thisdistribution yt. For example, thegreedy choice is simply to choose themost probable word to generateat each timestep. We ll introducemoresophisticated sampling methodsin Section ??. y t 1 is the embedding for the output sampled from the softmax at the previous step ytis a vector of probabilities over the vocabulary, representing the probability of each word occurring at time t. To generate text, we sample from this distribution yt. 8.7.1 Training theEncoder-Decoder Model Encoder-decoder architectures are trained end-to-end. Each training example is a tupleof paired strings, asourceand atarget. Concatenated with aseparator token, thesesource-target pairscan now serveastraining data. For MT,thetraining datatypically consistsof setsof sentencesandtheir transla- tions. Thesecan bedrawnfromstandard datasetsof aligned sentencepairs, aswe ll discuss in Section ??. Once we have a training set, the training itself proceeds as withany RNN-based languagemodel. Thenetwork isgiventhesourcetext andthen starting withtheseparator tokenistrained autoregressively topredict thenext word, asshownin Fig. 8.19. Notethedifferencesbetween training (Fig. 8.19) and inference(Fig. 8.17) with respect to theoutputsat each timestep. Thedecoder during inference usesitsown estimated output ytas the input for the next time step xt+1. Thus the decoder will tend to deviate moreand morefrom thegold target sentence asit keepsgenerating moretokens. Intraining, therefore, it ismorecommontouseteacher forcinginthe decoder. Teacher forcingmeansthat weforcethesystemtousethegoldtarget token teacher forcing

Related


More Related Content