Exploring RNNs and CNNs for Sequence Modelling: A Dive into Recent Trends and TCN Models

Slide Note

Today's presentation will delve into the comparison between RNNs and CNNs for various tasks, discuss a state-of-the-art approach for Sequence Modelling, and explore augmented RNN models. The discussion will include empirical evaluations, baseline model choices for tasks like text classification and music note prediction, recent trends in Sequence Modelling showcasing alternatives to RNNs, and insights into Temporal Convolutional Networks (TCNs) with features like dilated convolutions and residual connections.

alry707 Follow

Uploaded on Sep 19, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Outline for todays presentation We will see how RNNs and CNNs compare on variety of tasks Then, we will go through a new approach for Sequence Modelling that has become state of the art Finally, we will look at a few augmented RNN models

RNNs vs CNNs Empirical Evaluation of Generic Networks for Sequence Modelling

Lets say you are given a sequence modelling task of text classification / music note prediction, and you are asked to develop a simple model. What would your baseline model be based on- RNNs or CNNs?

Recent Trend in Sequence Modelling Widely considered as RNNs home turf Recent research has shown otherwise Speech Synthesis WaveNet uses Dilated Convolutions for Synthesis Char-to-Char Machine Translation ByteNet uses Encoder-Decoder architecture and Dilated Convolutions. Tested on English-German dataset Word-to-Word Machine Translation Hybrid CNN-LSTM on English-Romanian and English-French datasets Character-level Language Modelling ByteNet on WikiText dataset Word-level Language Modelling Gated CNNs on WikiText dataset

Temporal Convolutional Network (TCN) Model that uses best practices in Convolutional network design The properties of TCN - Causal there is no information leakage from future to past Memory It can look very far into the past for prediction/synthesis Input - It can take any arbitrary length sequence with proper tuning to the particular task. Simple It uses no gating mechanism, no complex stacking mechanism and each layer output has the same length as the input Components of TCN 1-D Dilated Convolutions Residual Connections

TCN - Dilated Convolutions 1-D Dilated Convolutions 1-D Convolutions Source - WaveNet

TCN - Residual Connections Residual Block of TCN Example of Residual Connection in TCN Layers learn modification to the identity mapping rather than the transformation Has shown to be very useful for very deep networks

TCN - Weight Normalization Shortcomings of Batch Normalization It needs two passes of the input one to compute the batch statistics and then to normalize Takes significant amount of time to be computed for each batch Dependent on the batch size not very useful when size is small Cannot be used when we are training in an online setting Normalizes the weights with respect to each training example The main aim is to decouple the magnitude and the direction ?? ? ??= ?? + ?? ?? 2+ ? It has shown to be faster than Batch Norm

TCN Advantages/Disadvantages [+] Parallelism Each layer of a CNN network can be parallelized [+] Receptive field size Can be easily increased by increasing either filter length, dilation factor or depth [+] Stable gradients Uses Residual connections and Dropouts [+] Storage (train) Memory footprint is lesser than RNNs [+] Sequence length Can be easily adopted for variable input length [-] Storage (test) During testing, it requires more memory

Experimental Setup TCN filter size, dilation factor and number of layers are chosen to cover the entire receptive field Vanilla RNN/LSTM/GRU hidden nodes and layers are chosen to have roughly the same number of parameters as TCN For both the models, the hyperparameter search was used - Gradient clipping [0.3, 1] Dropout [0, 0.5] Optimizers - SGD/RMSProp/AdaGrad/Adam Weights Initialization Gaussian with ?(0,0.01) Exponential dilation (for TCN)

Datasets Adding Problem Serves as a stress test for sequence models Consists of an input of length n and depth 2, with the first dimension being randomly assigned as 0 or 1, and the second dimension having 1s at two places only Sequential MNIST and P-MNIST Tests the ability to remember distant past Consists of a 784x1 MNIST s digit image for digit classification. P-MNIST has the pixels values permuted Copy Memory Tests the memory capacity of the model Consists of an input of length n+20 with first 10 digits randomly selected from 1 to 8, and the last 10 digits being 9, with everything else as 0 Goal is to copy the first 10 values to the last ten placeholder values

Polyphonic Music Consists of sequence of piano keys of length 88 Goal is to predict the next key in the sequence Penn Treebank (PTB) It is a small language modelling dataset for both word and char-level Consists of 5059K characters or 888K words for training Wikitext-103 Consists of 28K Wikipedia articles for word-level language modelling Consists of 103M words for training LAMBADA Tests the ability to capture longer and broader contexts Consists of 10K passages extracted from novels and serves as a QnA dataset

Results

Performance Analysis TCN vs TCN with Gating Mechanism

Inferences Inferences are made on the following categories 1. Memory - Copy memory task was designed to check propagation of information TCNs achieve almost 100% accuracy whereas RNNs fail at higher sequence lengths LAMBADA dataset was designed to test the local and broader contexts TCNs again outperform all its recurrent counterparts 2. Convergence In almost all of the tasks, TCNs converged faster than RNNs The extent of parallelism possible can be one explanation Concluded that given enough research, TCNs can outperform SOTA RNN models