Easy Data Augmentation for Language Models
Data augmentation plays a crucial role in enhancing model performance, especially for tasks like sentiment analysis, topic labeling, and language detection. By generating more training data and reducing overfitting, techniques like Synonym Replacement, Random Insertion, Random Swap, and Random Deletion can significantly boost text classification accuracy. This approach is particularly beneficial for smaller datasets and tasks such as text clustering. Despite some challenges and limitations, language models trained with data augmentation show promising results in improving accuracy.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Easy data augmentation language Model 1
Data augmentation Benefits: 1. Produce More Training data specially when we have small dataset 2. Reducing overfitting Tasks for Text: sentiment analysis, topic labeling , language detection, intent detection 2
EDA: Easy data augmentation for boosting performance on text classification Synonym replacement(SR) Random insertion(RI) Random swap(RS) Random deletion(RD) Number of words that should change n= l 3
benchmarks SST-2 ,CR ,SUBJ ,TREC ,PC We hypothesize that EDA is more helpful for smaller datasets SST-2 , Text-Clustering: cluster 200 different texts related to games and sports 4
Our approach Synonym replacement Random insertion Random swap Random deletion Random deletion+ Synonym replacement Random not+ opposite 5
= 0.05 Num_aug=16 7
Language Model Language Modelling is the core problem for a number of of natural language processing tasks such as speech to text, conversational system, and text summarization. A trained language model learns the likelihood of occurrence of a word based on the previous sequence of words used in the text. 8
Languagemodeling We used lstm for our model to predict the next words and we trained on 2 datasets The accuracy of with-without augmentation has shown Our model predict not bad according to that we ran it 5 epochs 9
Language Model after 5 epoch SST-2 Text-Clustering of games data 3.8 6.1 3.5 6.2 4.0 6.3 10
Challenges Our augment approach improves the accuracy but it wasn t so much. For our limitations we used just part of our datasets . We just Run LM code in 5 epochs because it takes so much time to execute. 11