Understanding GRU in Neural Networks

comp4332 rmbi4310 l.w
1 / 38
Embed
Share

Explore the Gated Recurrent Unit (GRU) as a simpler variation of the LSTM model in neural networks. Learn about its advantages, properties compared to traditional LSTM, and how it mimics brain processes.

  • GRU model
  • Neural networks
  • LSTM
  • Gated Recurrent Unit
  • Brain simulation

Uploaded on | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. COMP4332/RMBI4310 GRU(Concept) Prepared by Raymond Wong Presented by Raymond Wong raywong@cse RNN 1

  2. RNN 1. Basic RNN 2. Traditional LSTM 3. GRU RNN 2

  3. GRU GRU (Gated Recurrent Unit) is a variation of the traditional LSTM model. Its structure is similar to the traditional LSTM model. But, its structure is simpler . Before we introduce GRU, let us see the properties of the traditional LSTM RNN 3

  4. GRU Properties of the traditional LSTM The traditional LSTM model has a greater power of capturing the properties in the data. Thus, the result generated by this model is usually more accurate. Besides, it could remember or memorize longer sequences. RNN 4

  5. GRU Since the structure of GRU is simpler than the traditional LSTM model, it has the following advantages The training time is shorter It requires fewer data points to capture the properties of the data. RNN 5

  6. GRU Different from the traditional LSTM model, the GRU model does not have an internal state variable (i.e., variable st) to store our memory (i.e., a value) It regards the predicted target attribute value of the previous record (with an internal operation called reset ) as a reference to store our memory RNN 6

  7. GRU Similarly, the GRU model simulates the brain process. Reset Feature It could regard the predicted target attribute value of the previous record as a reference to store the memory Input Feature It could decide the strength of the input for the model (i.e., the activation function) RNN 7

  8. GRU Output Feature It could combine a portion of the predicted target attribute value of the previous record and a portion of the processed input variable The ratio of these 2 portions is determined by the update feature. RNN 8

  9. GRU Our brain includes the following steps. Reset component Input activation component Update component Final output component Reset gate Input activation gate Update gate Final output gate RNN 9

  10. xt-1 yt-1 Timestamp = t-1 RNN st-1 xt Timestamp = t RNN yt st xt+1 Timestamp = t+1 yt+1 RNN RNN 10

  11. xt-1 yt-1 Timestamp = t-1 Traditional LSTM st-1 xt Timestamp = t Traditional LSTM yt st xt+1 Timestamp = t+1 yt+1 Traditional LSTM RNN 11

  12. xt-1 yt-1 Timestamp = t-1 GRU xt Timestamp = t GRU yt xt+1 Timestamp = t+1 yt+1 GRU RNN 12

  13. xt-1 yt-1 Timestamp = t-1 GRU xt Timestamp = t Memory Unit GRU yt xt+1 Timestamp = t+1 yt+1 GRU RNN 13

  14. 0.7 0.3 0.4 Wr = br = 0.4 xt-1 yt-1 Timestamp = t-1 GRU rt xt Timestamp = t rt = (Wr [xt, yt-1] + br) Reset gate rt = (Wr [xt, yt-1] + br) yt xt+1 Timestamp = t+1 yt+1 GRU RNN 14

  15. 0.2 0.3 0.4 Wa = ba = 0.3 xt-1 yt-1 Timestamp = t-1 GRU rt xt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) at Input activation gate at = tanh(Wa [xt, rt . yt-1] + ba) yt xt+1 Timestamp = t+1 yt+1 GRU RNN 15

  16. 0.4 0.2 0.1 Wu = bu = 0.5 xt-1 yt-1 Timestamp = t-1 GRU rt xt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at ut yt Update gate xt+1 Timestamp = t+1 yt+1 GRU ut = (Wu [xt, yt-1] + bu) RNN 16

  17. xt-1 yt-1 Timestamp = t-1 RNN rt xt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut yt + 1- x Final output gate xt+1 Timestamp = t+1 yt+1 GRU yt = (1 - ut) . yt-1 + ut . at RNN 17

  18. xt-1 yt-1 Timestamp = t-1 RNN rt xt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut yt + 1- x xt+1 Timestamp = t+1 yt+1 GRU RNN 18

  19. In the following, we want to compute (weight) values in GRU. Similar to the neural network, the GRU has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on Input Forward Propagation . In GRU, Error Backward Propagation could be solved by an existing optimization tool (like Neural Network ). RNN 19

  20. Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y 0.3 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use GRU to do the training. RNN 20

  21. When t = 1 xt-1 x0 yt-1 y0 Timestamp = t-1 0 RNN r1 rt xt x1 1 Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at a1 yt = (1 - ut) . yt-1 + ut . at x u1 ut yt y1 + 1- x xt+1 x2 2 Timestamp = t+1 yt+1 y2 GRU RNN 21

  22. rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 r1 = (Wr [x1, y0] + br) 0.7 0.3 0.4 = (0.7 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.4) = (0.59) = 0.6434 0.1 0.4 0 0.7 0.3 0.4 0.2 0.3 0.4 = ( + 0.4) Wr = br = 0.4 Wa = ba = 0.3 r1 = 0.6434 0.4 0.2 0.1 Wu = bu = 0.5 RNN 22

  23. rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 a1 = tanh(Wa [x1, r1 . y0] + ba) 0.2 0.3 0.4 = tanh(0.2 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.3) 0.1 0.4 0.7 0.3 0.4 0.2 0.3 0.4 = tanh( + 0.3) Wr = br = 0.4 0.6434 0 Wa = ba = 0.3 = tanh(0.44) = 0.4136 r1 = 0.6434 0.4 0.2 0.1 a1 = 0.4136 Wu = bu = 0.5 RNN 23

  24. rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 u1 = (Wu [x1, y0] + bu) 0.4 0.2 0.1 = (0.4 . 0.1 + 0.2 . 0.4 + 0.1 . 0 + 0.5) = (0.62) = 0.6502 0.1 0.4 0 0.7 0.3 0.4 0.2 0.3 0.4 = ( + 0.5) Wr = br = 0.4 Wa = ba = 0.3 r1 = 0.6434 0.4 0.2 0.1 a1 = 0.4136 Wu = bu = 0.5 u1 = 0.6502 RNN 24

  25. rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 = (1 u1) . y0 + u1 . a1 = (1 0.6502) . 0 + 0.6502 . 0.4136 0.7 0.3 0.4 0.2 0.3 0.4 Wr = br = 0.4 = 0.2690 Wa = ba = 0.3 r1 = 0.6434 0.4 0.2 0.1 a1 = 0.4136 Wu = bu = 0.5 u1 = 0.6502 y1 = 0.2690 RNN 25

  26. rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 Error = y1 - y = 0.2690 0.3 0.7 0.3 0.4 0.2 0.3 0.4 Wr = br = 0.4 = -0.0310 Wa = ba = 0.3 r1 = 0.6434 0.4 0.2 0.1 a1 = 0.4136 Wu = bu = 0.5 u1 = 0.6502 y1 = 0.2690 RNN 26

  27. rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 0.7 0.3 0.4 0.2 0.3 0.4 Wr = br = 0.4 Wa = ba = 0.3 r1 = 0.6434 0.4 0.2 0.1 a1 = 0.4136 Wu = bu = 0.5 u1 = 0.6502 y1 = 0.2690 RNN 27

  28. rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 r2 = (Wr [x2, y1] + br) 0.7 0.3 0.4 = (0.7 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2690 + 0.4) = (1.2676) = 0.7803 0.7 0.9 0.7 0.3 0.4 0.2 0.3 0.4 = ( + 0.4) Wr = br = 0.4 0.2690 Wa = ba = 0.3 r2 = 0.7803 0.4 0.2 0.1 Wu = bu = 0.5 RNN 28

  29. rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 a2 = tanh(Wa [x2, r2 . y1] + ba) 0.2 0.3 0.4 = tanh(0.2 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2099 + 0.3) 0.7 0.9 0.7 0.3 0.4 0.2 0.3 0.4 = tanh( + 0.3) Wr = br = 0.4 0.7803 0.2690 Wa = ba = 0.3 = tanh(0.7940) = 0.6606 r2 = 0.7803 0.4 0.2 0.1 a2 = 0.6606 Wu = bu = 0.5 RNN 29

  30. rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 u2 = (Wu [x2, y1] + bu) 0.4 0.2 0.1 = (0.4 . 0.7 + 0.2 . 0.9 + 0.1 . 0.2690 + 0.5) = (0.9869) = 0.7285 0.7 0.9 0.7 0.3 0.4 0.2 0.3 0.4 = ( + 0.5) Wr = br = 0.4 0.2690 Wa = ba = 0.3 r2 = 0.7803 0.4 0.2 0.1 a2 = 0.6606 Wu = bu = 0.5 u2 = 0.7285 RNN 30

  31. rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 y2 = (1 u2) . y1 + u2 . a2 = (1 0.7285) . 0.2690 + 0.7285 . 0.6606 0.7 0.3 0.4 0.2 0.3 0.4 Wr = br = 0.4 = 0.5543 Wa = ba = 0.3 r2 = 0.7803 0.4 0.2 0.1 a2 = 0.6606 Wu = bu = 0.5 u2 = 0.7285 y2 = 0.5543 RNN 31

  32. rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 Error = y2 - y = 0.5543 0.5 0.7 0.3 0.4 0.2 0.3 0.4 Wr = br = 0.4 = 0.0543 Wa = ba = 0.3 r2 = 0.7803 0.4 0.2 0.1 a2 = 0.6606 Wu = bu = 0.5 u2 = 0.7285 y2 = 0.5543 RNN 32

  33. Similar to the neural network, GRU could also have multiple layers and have multiple memory units in each layer. RNN 33

  34. Multi-layer RNN input x1 output RNN y x2 input x1 output Memory Unit y x2 RNN 34

  35. Multi-layer RNN input x1 output RNN y x2 input output x1 y x2 RNN 35

  36. Multi-layer RNN input x1 output RNN y x2 input output x1 y x2 RNN 36

  37. Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer Hidden layer Output layer RNN 37

  38. Final Summary about Parameter Setting GRU Model Parameter No. of layers No. of memory units in each layer Connection between memory units from different layers Optimization Method Error Function adam SGD rmsprop Binary Cross Entropy, mse, mae Training (Time) Parameter No of epochs Batch size Evaluation Measurement Training/Validation/Test We could set no. of epochs = 150 as a stopping condition We could set Batch Size= 10 (for example) e.g., accuracy (or in short, acc ) e.g., percentage of the data for the validation/test set RNN 38

More Related Content