
Understanding GRU in Neural Networks
Explore the Gated Recurrent Unit (GRU) as a simpler variation of the LSTM model in neural networks. Learn about its advantages, properties compared to traditional LSTM, and how it mimics brain processes.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
COMP4332/RMBI4310 GRU(Concept) Prepared by Raymond Wong Presented by Raymond Wong raywong@cse RNN 1
RNN 1. Basic RNN 2. Traditional LSTM 3. GRU RNN 2
GRU GRU (Gated Recurrent Unit) is a variation of the traditional LSTM model. Its structure is similar to the traditional LSTM model. But, its structure is simpler . Before we introduce GRU, let us see the properties of the traditional LSTM RNN 3
GRU Properties of the traditional LSTM The traditional LSTM model has a greater power of capturing the properties in the data. Thus, the result generated by this model is usually more accurate. Besides, it could remember or memorize longer sequences. RNN 4
GRU Since the structure of GRU is simpler than the traditional LSTM model, it has the following advantages The training time is shorter It requires fewer data points to capture the properties of the data. RNN 5
GRU Different from the traditional LSTM model, the GRU model does not have an internal state variable (i.e., variable st) to store our memory (i.e., a value) It regards the predicted target attribute value of the previous record (with an internal operation called reset ) as a reference to store our memory RNN 6
GRU Similarly, the GRU model simulates the brain process. Reset Feature It could regard the predicted target attribute value of the previous record as a reference to store the memory Input Feature It could decide the strength of the input for the model (i.e., the activation function) RNN 7
GRU Output Feature It could combine a portion of the predicted target attribute value of the previous record and a portion of the processed input variable The ratio of these 2 portions is determined by the update feature. RNN 8
GRU Our brain includes the following steps. Reset component Input activation component Update component Final output component Reset gate Input activation gate Update gate Final output gate RNN 9
xt-1 yt-1 Timestamp = t-1 RNN st-1 xt Timestamp = t RNN yt st xt+1 Timestamp = t+1 yt+1 RNN RNN 10
xt-1 yt-1 Timestamp = t-1 Traditional LSTM st-1 xt Timestamp = t Traditional LSTM yt st xt+1 Timestamp = t+1 yt+1 Traditional LSTM RNN 11
xt-1 yt-1 Timestamp = t-1 GRU xt Timestamp = t GRU yt xt+1 Timestamp = t+1 yt+1 GRU RNN 12
xt-1 yt-1 Timestamp = t-1 GRU xt Timestamp = t Memory Unit GRU yt xt+1 Timestamp = t+1 yt+1 GRU RNN 13
0.7 0.3 0.4 Wr = br = 0.4 xt-1 yt-1 Timestamp = t-1 GRU rt xt Timestamp = t rt = (Wr [xt, yt-1] + br) Reset gate rt = (Wr [xt, yt-1] + br) yt xt+1 Timestamp = t+1 yt+1 GRU RNN 14
0.2 0.3 0.4 Wa = ba = 0.3 xt-1 yt-1 Timestamp = t-1 GRU rt xt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) at Input activation gate at = tanh(Wa [xt, rt . yt-1] + ba) yt xt+1 Timestamp = t+1 yt+1 GRU RNN 15
0.4 0.2 0.1 Wu = bu = 0.5 xt-1 yt-1 Timestamp = t-1 GRU rt xt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at ut yt Update gate xt+1 Timestamp = t+1 yt+1 GRU ut = (Wu [xt, yt-1] + bu) RNN 16
xt-1 yt-1 Timestamp = t-1 RNN rt xt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut yt + 1- x Final output gate xt+1 Timestamp = t+1 yt+1 GRU yt = (1 - ut) . yt-1 + ut . at RNN 17
xt-1 yt-1 Timestamp = t-1 RNN rt xt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut yt + 1- x xt+1 Timestamp = t+1 yt+1 GRU RNN 18
In the following, we want to compute (weight) values in GRU. Similar to the neural network, the GRU has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on Input Forward Propagation . In GRU, Error Backward Propagation could be solved by an existing optimization tool (like Neural Network ). RNN 19
Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y 0.3 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use GRU to do the training. RNN 20
When t = 1 xt-1 x0 yt-1 y0 Timestamp = t-1 0 RNN r1 rt xt x1 1 Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at a1 yt = (1 - ut) . yt-1 + ut . at x u1 ut yt y1 + 1- x xt+1 x2 2 Timestamp = t+1 yt+1 y2 GRU RNN 21
rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 r1 = (Wr [x1, y0] + br) 0.7 0.3 0.4 = (0.7 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.4) = (0.59) = 0.6434 0.1 0.4 0 0.7 0.3 0.4 0.2 0.3 0.4 = ( + 0.4) Wr = br = 0.4 Wa = ba = 0.3 r1 = 0.6434 0.4 0.2 0.1 Wu = bu = 0.5 RNN 22
rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 a1 = tanh(Wa [x1, r1 . y0] + ba) 0.2 0.3 0.4 = tanh(0.2 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.3) 0.1 0.4 0.7 0.3 0.4 0.2 0.3 0.4 = tanh( + 0.3) Wr = br = 0.4 0.6434 0 Wa = ba = 0.3 = tanh(0.44) = 0.4136 r1 = 0.6434 0.4 0.2 0.1 a1 = 0.4136 Wu = bu = 0.5 RNN 23
rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 u1 = (Wu [x1, y0] + bu) 0.4 0.2 0.1 = (0.4 . 0.1 + 0.2 . 0.4 + 0.1 . 0 + 0.5) = (0.62) = 0.6502 0.1 0.4 0 0.7 0.3 0.4 0.2 0.3 0.4 = ( + 0.5) Wr = br = 0.4 Wa = ba = 0.3 r1 = 0.6434 0.4 0.2 0.1 a1 = 0.4136 Wu = bu = 0.5 u1 = 0.6502 RNN 24
rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 = (1 u1) . y0 + u1 . a1 = (1 0.6502) . 0 + 0.6502 . 0.4136 0.7 0.3 0.4 0.2 0.3 0.4 Wr = br = 0.4 = 0.2690 Wa = ba = 0.3 r1 = 0.6434 0.4 0.2 0.1 a1 = 0.4136 Wu = bu = 0.5 u1 = 0.6502 y1 = 0.2690 RNN 25
rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 Error = y1 - y = 0.2690 0.3 0.7 0.3 0.4 0.2 0.3 0.4 Wr = br = 0.4 = -0.0310 Wa = ba = 0.3 r1 = 0.6434 0.4 0.2 0.1 a1 = 0.4136 Wu = bu = 0.5 u1 = 0.6502 y1 = 0.2690 RNN 26
rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 0.7 0.3 0.4 0.2 0.3 0.4 Wr = br = 0.4 Wa = ba = 0.3 r1 = 0.6434 0.4 0.2 0.1 a1 = 0.4136 Wu = bu = 0.5 u1 = 0.6502 y1 = 0.2690 RNN 27
rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 r2 = (Wr [x2, y1] + br) 0.7 0.3 0.4 = (0.7 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2690 + 0.4) = (1.2676) = 0.7803 0.7 0.9 0.7 0.3 0.4 0.2 0.3 0.4 = ( + 0.4) Wr = br = 0.4 0.2690 Wa = ba = 0.3 r2 = 0.7803 0.4 0.2 0.1 Wu = bu = 0.5 RNN 28
rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 a2 = tanh(Wa [x2, r2 . y1] + ba) 0.2 0.3 0.4 = tanh(0.2 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2099 + 0.3) 0.7 0.9 0.7 0.3 0.4 0.2 0.3 0.4 = tanh( + 0.3) Wr = br = 0.4 0.7803 0.2690 Wa = ba = 0.3 = tanh(0.7940) = 0.6606 r2 = 0.7803 0.4 0.2 0.1 a2 = 0.6606 Wu = bu = 0.5 RNN 29
rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 u2 = (Wu [x2, y1] + bu) 0.4 0.2 0.1 = (0.4 . 0.7 + 0.2 . 0.9 + 0.1 . 0.2690 + 0.5) = (0.9869) = 0.7285 0.7 0.9 0.7 0.3 0.4 0.2 0.3 0.4 = ( + 0.5) Wr = br = 0.4 0.2690 Wa = ba = 0.3 r2 = 0.7803 0.4 0.2 0.1 a2 = 0.6606 Wu = bu = 0.5 u2 = 0.7285 RNN 30
rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 y2 = (1 u2) . y1 + u2 . a2 = (1 0.7285) . 0.2690 + 0.7285 . 0.6606 0.7 0.3 0.4 0.2 0.3 0.4 Wr = br = 0.4 = 0.5543 Wa = ba = 0.3 r2 = 0.7803 0.4 0.2 0.1 a2 = 0.6606 Wu = bu = 0.5 u2 = 0.7285 y2 = 0.5543 RNN 31
rt = (Wr [xt, yt-1] + br) Time t=1 t=2 xt, 1 0.1 0.7 xt, 2 0.4 0.9 y at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at 0.3 0.5 Step 1 (Input Forward Propagation) y0 0 y1 0.2690 Error = y2 - y = 0.5543 0.5 0.7 0.3 0.4 0.2 0.3 0.4 Wr = br = 0.4 = 0.0543 Wa = ba = 0.3 r2 = 0.7803 0.4 0.2 0.1 a2 = 0.6606 Wu = bu = 0.5 u2 = 0.7285 y2 = 0.5543 RNN 32
Similar to the neural network, GRU could also have multiple layers and have multiple memory units in each layer. RNN 33
Multi-layer RNN input x1 output RNN y x2 input x1 output Memory Unit y x2 RNN 34
Multi-layer RNN input x1 output RNN y x2 input output x1 y x2 RNN 35
Multi-layer RNN input x1 output RNN y x2 input output x1 y x2 RNN 36
Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer Hidden layer Output layer RNN 37
Final Summary about Parameter Setting GRU Model Parameter No. of layers No. of memory units in each layer Connection between memory units from different layers Optimization Method Error Function adam SGD rmsprop Binary Cross Entropy, mse, mae Training (Time) Parameter No of epochs Batch size Evaluation Measurement Training/Validation/Test We could set no. of epochs = 150 as a stopping condition We could set Batch Size= 10 (for example) e.g., accuracy (or in short, acc ) e.g., percentage of the data for the validation/test set RNN 38