Recurrent Neural Networks (RNNs) and LSTM Variants

Some RNN Variants
Arun Mallya
Best viewed with 
 installed
Computer Modern fonts
Outline
 
Why Recurrent Neural Networks (RNNs)?
The Vanilla RNN unit
The RNN forward pass
Backpropagation refresher
The RNN backward pass
Issues with the Vanilla RNN
The Long Short-Term Memory (LSTM) unit
The LSTM Forward & Backward pass
LSTM variants and tips
Peephole LSTM
GRU
The Vanilla RNN Cell
3
 
 
W
The Vanilla RNN Forward
4
 
 
The Vanilla RNN Forward
5
 
 
indicates shared weights
The Vanilla RNN Backward
6
The Popular LSTM Cell
i
t
o
t
f
t
Input Gate
Output Gate
Forget Gate
h
t
7
x
t 
     h
t-1
  
Cell
c
t-1
x
t  
       h
t-1
  
x
t           
h
t-1
  
 x
t
h
t-1
W
W
i
W
o
W
f
 
Similarly for 
i
t
, 
o
t
* 
Dashed line indicates time-lag
LSTM – Forward/Backward
8
Go To: 
Illustrated LSTM Forward and Backward Pass
Class Exercise
9
Consider the problem of translation of English to French
E.g. What is your name      Comment tu t'appelle
Is the below architecture suitable for this problem?
Adapted from 
http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf
 
Class Exercise
10
Consider the problem of translation of English to French
E.g. What is your name      Comment tu t'appelle
Is the below architecture suitable for this problem?
No, sentences might be of different length and words might
not align. Need to see entire sentence before translating
Adapted from 
http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf
 
Class Exercise
11
Consider the problem of translation of English to French
E.g. What is your name      Comment tu t'appelle
Sentences might be of different length and words might not
align. Need to see entire sentence before translating
Input-Output nature depends on the structure of the problem
at hand
Seq2Seq Learning with Neural Networks, Sutskever 
et al.
, 2014
Multi-layer RNNs
12
 
We can of course design RNNs with multiple hidden layers
 
Think exotic: Skip connections across layers, across time, …
Bi-directional RNNs
13
 
RNNs can process the input sequence in forward and in the
reverse direction
 
x
1
 
x
2
 
x
3
 
x
4
 
x
5
 
x
6
 
y
1
 
y
2
 
y
3
 
y
4
 
y
5
 
y
6
 
Popular in speech recognition
Recap
14
RNNs allow for processing of variable length inputs and
outputs by maintaining state information across time steps
Various Input-Output scenarios are possible
(Single/Multiple)
RNNs can be stacked, or bi-directional
Vanilla RNNs are improved upon by LSTMs which address the
vanishing gradient problem through the CEC
Exploding gradients are handled by gradient clipping
The Popular LSTM Cell
i
t
o
t
f
t
Input Gate
Output Gate
Forget Gate
h
t
15
x
t 
     h
t-1
  
Cell
c
t-1
x
t  
       h
t-1
  
x
t           
h
t-1
  
 x
t
h
t-1
W
W
i
W
o
W
f
Similarly for 
i
t
, 
o
t
* 
Dashed line indicates time-lag
Extension I: Peephole LSTM
i
t
o
t
f
t
Input Gate
Output Gate
Forget Gate
h
t
16
x
t 
     h
t-1
  
Cell
c
t-1
x
t  
       h
t-1
  
x
t           
h
t-1
  
 x
t
h
t-1
W
W
i
W
o
W
f
 
Similarly for 
i
t
, 
o
t 
(
uses 
c
t
)
* 
Dashed line indicates time-lag
The Popular LSTM Cell
i
t
o
t
f
t
Input Gate
Output Gate
Forget Gate
h
t
17
x
t 
     h
t-1
  
Cell
c
t-1
x
t  
       h
t-1
  
x
t           
h
t-1
  
 x
t
h
t-1
W
W
i
W
o
W
f
Similarly for 
i
t
, 
o
t
* 
Dashed line indicates time-lag
Extension I: Peephole LSTM
i
t
o
t
f
t
Input Gate
Output Gate
Forget Gate
h
t
18
x
t 
     h
t-1
  
Cell
c
t-1
x
t  
       h
t-1
  
x
t           
h
t-1
  
 x
t
h
t-1
W
W
i
W
o
W
f
Similarly for 
i
t
, 
o
t 
(
uses 
c
t
)
* 
Dashed line indicates time-lag
Peephole LSTM
 
Gates can only see the output from the previous time step,
which is close to 0 if the output gate is closed. However, these
gates control the CEC cell.
 
Helped the LSTM learn better timing for the problems tested
– Spike timing and Counting spike time delays
Recurrent nets that time and count, Gers 
et al
.
, 2000
Other minor variants
 
Coupled Input and Forget Gate
 
Full Gate Recurrence
LSTM: A Search Space Odyssey
 
Tested the following variants, using Peephole LSTM as
standard:
1.
No Input Gate (NIG)
2.
No Forget Gate (NFG)
3.
No Output Gate (NOG)
4.
No Input Activation Function (NIAF)
5.
No Output Activation Function (NOAF)
6.
No Peepholes (NP)
7.
Coupled Input and Forget Gate (CIFG)
8.
Full Gate Recurrence (FGR)
On the tasks of:
Timit Speech Recognition: Audio frame to 1 of 61 phonemes
IAM Online Handwriting Recognition: Sketch to characters
JSB Chorales: Next-step music frame prediction
LSTM: A Search Space Odyssey, Greff 
et al
., 2015
LSTM: A Search Space Odyssey
 
The standard LSTM performed reasonably well on multiple
datasets and none of the modifications significantly improved
the performance
Coupling gates and removing peephole connections simplified
the LSTM without hurting performance much
The forget gate and output activation are crucial
 
Found interaction between learning rate and network size to
be minimal – indicates calibration can be done using a small
network first
LSTM: A Search Space Odyssey, Greff 
et al
., 2015
Gated Recurrent Unit (GRU)
 
A very simplified version of the LSTM
Merges forget and input gate into a single ‘update’ gate
Merges cell and hidden state
 
Has fewer parameters than an LSTM and has been shown to
outperform LSTM on some tasks
Learning Phrase Representations using RNN Encoder-Decoder for 
Statistical Machine Translation, Cho 
et al
., 2014
GRU
z
t
r
t
Update Gate
Reset Gate
h
t
24
x
t 
     h
t-1
  
x
t           
h
t-1
  
h
t-1
W
W
z
W
f
 x
t
h’
t
GRU
r
t
Reset Gate
25
x
t 
     h
t-1
  
W
f
GRU
r
t
Reset Gate
26
x
t 
     h
t-1
  
h
t-1
W
W
f
 x
t
h’
t
GRU
z
t
r
t
Update Gate
Reset Gate
27
x
t 
     h
t-1
  
x
t           
h
t-1
  
h
t-1
W
W
z
W
f
 x
t
h’
t
GRU
z
t
r
t
Update Gate
Reset Gate
h
t
28
x
t 
     h
t-1
  
x
t           
h
t-1
  
h
t-1
W
W
z
W
f
 x
t
h’
t
An Empirical Exploration of Recurrent
Network Architectures
 
Given the rather ad-hoc design of the LSTM, the authors try to
determine if the architecture of the LSTM is optimal
 
They use an evolutionary search for better architectures
An Empirical Exploration of Recurrent Network Architectures, Jozefowicz 
et al
., 2015
Evolutionary Architecture Search
 
A list of top-100 architectures so far is maintained, initialized
with the LSTM and the GRU
The GRU is considered as the baseline to beat
New architectures are proposed, and retained based on
performance ratio with GRU
All architectures are evaluated on 3 problems
Arithmetic: Compute digits of sum or difference of two numbers
provided as inputs. Inputs have distractors to increase difficulty
3
e
36
d
9-
h
1
h
39
f
94
eeh
43
keg
3
c = 3369 – 13994433 = -13991064
XML Modeling: Predict next character in valid XML modeling
Penn Tree-Bank Language Modeling: Predict distributions over words
An Empirical Exploration of Recurrent Network Architectures, Jozefowicz 
et al
., 2015
Evolutionary Architecture Search
 
At each step
Select 1 architecture at random, evaluate on 20 randomly chosen
hyperparameter settings.
Alternatively, propose a new architecture by mutating an existing one.
Choose probability 
p
 from [0,1] uniformly and apply a transformation
to each node with probability 
p
If node is a non-linearity, replace with {tanh(x), sigmoid(x), ReLU(x), Linear(0, x),
Linear(1, x), Linear(0.9, x), Linear(1.1, x)}
If node is an elementwise op, replace with {multiplication, addition, subtraction}
Insert random activation function between node and one of its parents
Replace node with one of its ancestors (remove node)
Randomly select a node (node A). Replace the current node with either the sum,
product, or difference of a random ancestor of the current node and a random
ancestor of A.
Add architecture to list based on minimum relative accuracy wrt GRU
on 3 different tasks
An Empirical Exploration of Recurrent Network Architectures, Jozefowicz 
et al
., 2015
Evolutionary Architecture Search
 
3 novel architectures are presented in the paper
Very similar to GRU, but slightly outperform it
 
LSTM initialized with a large positive forget gate bias
outperformed both the basic LSTM and the GRU!
An Empirical Exploration of Recurrent Network Architectures, Jozefowicz 
et al
., 2015
LSTM initialized with large positive forget
gate bias?
 
Recall
An Empirical Exploration of Recurrent Network Architectures, Jozefowicz 
et al
., 2015
 
Gradients will vanish if 
f
 is close to 0. Using a large positive bias ensures
that 
f
 has values close to 1, especially when training begins
Helps learn long-range dependencies
Originally stated in 
Learning to forget: Continual prediction with LSTM,
Gers
 et al.
, 2000
, but forgotten over time
Summary
34
 
LSTMs can be modified with Peephole Connections, Full Gate
Recurrence, etc. based on the specific task at hand
 
Architectures like the GRU have fewer parameters than the
LSTM and might perform better
 
An LSTM with large positive forget gate bias works best!
Other Useful Resources / References
35
http://cs231n.stanford.edu/slides/winter1516_lecture10.pdf
http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf
R. Pascanu, T. Mikolov, and Y. Bengio, 
On the difficulty of training recurrent neural
networks
, ICML 2013
S. Hochreiter, and J. Schmidhuber, 
Long short-term memory
, Neural computation, 1997
9(8), pp.1735-1780
F.A. Gers, and J. Schmidhuber, 
Recurrent nets that time and count
, IJCNN 2000
K. Greff , R.K. Srivastava, J. Koutník, B.R. Steunebrink, and J. Schmidhuber, 
LSTM: A
search space odyssey
, IEEE transactions on neural networks and learning systems, 2016
K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y.
Bengio, 
Learning phrase representations using RNN encoder-decoder for statistical
machine translation
, ACL 2014
R. Jozefowicz, W. Zaremba, and I. Sutskever, 
An empirical exploration of recurrent
network architectures
, JMLR 2015
 
Slide Note
Embed
Share

Explore the basics of Recurrent Neural Networks (RNNs) including the Vanilla RNN unit, LSTM unit, forward and backward passes, LSTM variants like Peephole LSTM and GRU. Dive into detailed illustrations and considerations for tasks like translation from English to French. Discover the inner workings of RNNs and gain insights into their application in sequential data processing.

  • RNNs
  • LSTM
  • Neural Networks
  • Sequential Data
  • Translation

Uploaded on Oct 02, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed

  2. Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN backward pass Issues with the Vanilla RNN The Long Short-Term Memory (LSTM) unit The LSTM Forward & Backward pass LSTM variants and tips Peephole LSTM GRU

  3. The Vanilla RNN Cell W xt ht ht-1 xt ht= tanhW ht-1 3

  4. The Vanilla RNN Forward C1 C2 C3 y1 y2 y3 xt ht= tanhW ht-1 h1 h2 h3 yt= F(ht) Ct= Loss(yt,GTt) x1 h0 x2 h1 x3 h2 4

  5. The Vanilla RNN Forward C1 C2 C3 y1 y2 y3 xt ht= tanhW ht-1 h1 h2 h3 yt= F(ht) Ct= Loss(yt,GTt) indicates shared weights x1 h0 x2 h1 x3 h2 5

  6. The Vanilla RNN Backward xt C1 C2 C3 ht= tanhW ht-1 y1 y2 y3 yt= F(ht) Ct= Loss(yt,GTt) h1 h2 h3 x1 h0 x2 h1 x3 h2 6

  7. The Popular LSTM Cell xt ht-1 xt ht-1 +bf xt Wo Wi ft=s Wf Input Gate Output Gate it ot ht-1 Similarly for it, ot W xt Cell ct-1 ht ht-1 ct= ft ct-1+ xt it tanhW ht-1 ft Forget Gate Wf ht= ot tanhct xt ht-1 7 * Dashed line indicates time-lag

  8. LSTM Forward/Backward Go To: Illustrated LSTM Forward and Backward Pass 8

  9. Class Exercise Consider the problem of translation of English to French E.g. What is your name Comment tu t'appelle Is the below architecture suitable for this problem? F1 F2 F3 E1 E2 E3 9 Adapted from http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf

  10. Class Exercise Consider the problem of translation of English to French E.g. What is your name Comment tu t'appelle Is the below architecture suitable for this problem? F1 F2 F3 E1 E2 E3 No, sentences might be of different length and words might not align. Need to see entire sentence before translating 10 Adapted from http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf

  11. Class Exercise Consider the problem of translation of English to French E.g. What is your name Comment tu t'appelle Sentences might be of different length and words might not align. Need to see entire sentence before translating F4 F1 F2 F3 E1 E2 E3 Input-Output nature depends on the structure of the problem at hand 11 Seq2Seq Learning with Neural Networks, Sutskever et al., 2014

  12. Multi-layer RNNs We can of course design RNNs with multiple hidden layers y1 y4 y2 y3 y5 y6 x1 x4 x2 x3 x5 x6 Think exotic: Skip connections across layers, across time, 12

  13. Bi-directional RNNs RNNs can process the input sequence in forward and in the reverse direction y1 y2 y3 y4 y5 y6 x1 x4 x2 x3 x5 x6 Popular in speech recognition 13

  14. Recap RNNs allow for processing of variable length inputs and outputs by maintaining state information across time steps Various Input-Output scenarios are possible (Single/Multiple) RNNs can be stacked, or bi-directional Vanilla RNNs are improved upon by LSTMs which address the vanishing gradient problem through the CEC Exploding gradients are handled by gradient clipping 14

  15. The Popular LSTM Cell xt ht-1 xt ht-1 +bf xt Wo Wi ft=s Wf Input Gate Output Gate it ot ht-1 Similarly for it, ot W xt Cell ct-1 ht ht-1 ct= ft ct-1+ xt it tanhW ht-1 ft Forget Gate Wf ht= ot tanhct xt ht-1 15 * Dashed line indicates time-lag

  16. Extension I: Peephole LSTM xt ht-1 xt ht-1 +bf xt Wo Wi ft=s Wf ht-1 ct-1 Input Gate Output Gate it ot Similarly for it, ot (uses ct) W xt Cell ct-1 ht ht-1 ct= ft ct-1+ xt it tanhW ht-1 ft Forget Gate Wf ht= ot tanhct xt ht-1 16 * Dashed line indicates time-lag

  17. The Popular LSTM Cell xt ht-1 xt ht-1 +bf xt Wo Wi ft=s Wf Input Gate Output Gate it ot ht-1 Similarly for it, ot W xt Cell ct-1 ht ht-1 ct= ft ct-1+ xt it tanhW ht-1 ft Forget Gate Wf ht= ot tanhct xt ht-1 17 * Dashed line indicates time-lag

  18. Extension I: Peephole LSTM xt ht-1 xt ht-1 +bf xt Wo Wi ft=s Wf ht-1 ct-1 Input Gate Output Gate it ot Similarly for it, ot (uses ct) W xt Cell ct-1 ht ht-1 ct= ft ct-1+ xt it tanhW ht-1 ft Forget Gate Wf ht= ot tanhct xt ht-1 18 * Dashed line indicates time-lag

  19. Peephole LSTM Gates can only see the output from the previous time step, which is close to 0 if the output gate is closed. However, these gates control the CEC cell. Helped the LSTM learn better timing for the problems tested Spike timing and Counting spike time delays Recurrent nets that time and count, Gers et al., 2000

  20. Other minor variants ft=1-it Coupled Input and Forget Gate xt ht-1 ct-1 it-1 ft-1 ot-1 ft=s Wf +bf Full Gate Recurrence

  21. LSTM: A Search Space Odyssey Tested the following variants, using Peephole LSTM as standard: 1. No Input Gate (NIG) 2. No Forget Gate (NFG) 3. No Output Gate (NOG) 4. No Input Activation Function (NIAF) 5. No Output Activation Function (NOAF) 6. No Peepholes (NP) 7. Coupled Input and Forget Gate (CIFG) 8. Full Gate Recurrence (FGR) On the tasks of: Timit Speech Recognition: Audio frame to 1 of 61 phonemes IAM Online Handwriting Recognition: Sketch to characters JSB Chorales: Next-step music frame prediction LSTM: A Search Space Odyssey, Greff et al., 2015

  22. LSTM: A Search Space Odyssey The standard LSTM performed reasonably well on multiple datasets and none of the modifications significantly improved the performance Coupling gates and removing peephole connections simplified the LSTM without hurting performance much The forget gate and output activation are crucial Found interaction between learning rate and network size to be minimal indicates calibration can be done using a small network first LSTM: A Search Space Odyssey, Greff et al., 2015

  23. Gated Recurrent Unit (GRU) A very simplified version of the LSTM Merges forget and input gate into a single update gate Merges cell and hidden state Has fewer parameters than an LSTM and has been shown to outperform LSTM on some tasks Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Cho et al., 2014

  24. GRU xt ht-1 +bf xt rt=s Wr Wz ht-1 Update Gate zt xt h't= tanhW xt W rt ht-1 h t ht ht-1 +bf xt zt=s Wz ht-1 rt Reset Gate Wf ht=(1-zt) ht-1+zt h't xt ht-1 24

  25. GRU +bf xt rt=s Wr ht-1 rt Reset Gate Wf xt ht-1 25

  26. GRU +bf xt rt=s Wr ht-1 xt h't= tanhW xt W rt ht-1 h t ht-1 rt Reset Gate Wf xt ht-1 26

  27. GRU xt ht-1 +bf xt rt=s Wr Wz ht-1 Update Gate zt xt h't= tanhW xt W rt ht-1 h t ht-1 +bf xt zt=s Wz ht-1 rt Reset Gate Wf xt ht-1 27

  28. GRU xt ht-1 +bf xt rt=s Wr Wz ht-1 Update Gate zt xt h't= tanhW xt W rt ht-1 h t ht ht-1 +bf xt zt=s Wz ht-1 rt Reset Gate Wf ht=(1-zt) ht-1+zt h't xt ht-1 28

  29. An Empirical Exploration of Recurrent Network Architectures Given the rather ad-hoc design of the LSTM, the authors try to determine if the architecture of the LSTM is optimal They use an evolutionary search for better architectures An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al., 2015

  30. Evolutionary Architecture Search A list of top-100 architectures so far is maintained, initialized with the LSTM and the GRU The GRU is considered as the baseline to beat New architectures are proposed, and retained based on performance ratio with GRU All architectures are evaluated on 3 problems Arithmetic: Compute digits of sum or difference of two numbers provided as inputs. Inputs have distractors to increase difficulty 3e36d9-h1h39f94eeh43keg3c = 3369 13994433 = -13991064 XML Modeling: Predict next character in valid XML modeling Penn Tree-Bank Language Modeling: Predict distributions over words An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al., 2015

  31. Evolutionary Architecture Search At each step Select 1 architecture at random, evaluate on 20 randomly chosen hyperparameter settings. Alternatively, propose a new architecture by mutating an existing one. Choose probability p from [0,1] uniformly and apply a transformation to each node with probability p If node is a non-linearity, replace with {tanh(x), sigmoid(x), ReLU(x), Linear(0, x), Linear(1, x), Linear(0.9, x), Linear(1.1, x)} If node is an elementwise op, replace with {multiplication, addition, subtraction} Insert random activation function between node and one of its parents Replace node with one of its ancestors (remove node) Randomly select a node (node A). Replace the current node with either the sum, product, or difference of a random ancestor of the current node and a random ancestor of A. Add architecture to list based on minimum relative accuracy wrt GRU on 3 different tasks An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al., 2015

  32. Evolutionary Architecture Search 3 novel architectures are presented in the paper Very similar to GRU, but slightly outperform it LSTM initialized with a large positive forget gate bias outperformed both the basic LSTM and the GRU! An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al., 2015

  33. LSTM initialized with large positive forget gate bias? Recall +bf xt ft=s Wf ht-1 xt ct= ft ct-1+it tanhW ht-1 dct-1=dct ft Gradients will vanish if f is close to 0. Using a large positive bias ensures that f has values close to 1, especially when training begins Helps learn long-range dependencies Originally stated in Learning to forget: Continual prediction with LSTM, Gers et al., 2000, but forgotten over time An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al., 2015

  34. Summary LSTMs can be modified with Peephole Connections, Full Gate Recurrence, etc. based on the specific task at hand Architectures like the GRU have fewer parameters than the LSTM and might perform better An LSTM with large positive forget gate bias works best! 34

  35. Other Useful Resources / References http://cs231n.stanford.edu/slides/winter1516_lecture10.pdf http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural networks, ICML 2013 S. Hochreiter, and J. Schmidhuber, Long short-term memory, Neural computation, 1997 9(8), pp.1735-1780 F.A. Gers, and J. Schmidhuber, Recurrent nets that time and count, IJCNN 2000 K. Greff , R.K. Srivastava, J. Koutn k, B.R. Steunebrink, and J. Schmidhuber, LSTM: A search space odyssey, IEEE transactions on neural networks and learning systems, 2016 K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, ACL 2014 R. Jozefowicz, W. Zaremba, and I. Sutskever, An empirical exploration of recurrent network architectures, JMLR 2015 35

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#