Text Mining

Slide Note
Embed
Share

Text mining and its challenges, explore different techniques and tools, and see a case study using Python for sentiment analysis and text classification.


Uploaded on Dec 21, 2023 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.



Presentation Transcript


  1. datascience-project.eu Text Mining By [partner] The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  2. Index 1 3 2 Unit 1: Introduction 1. What is text mining? 2. Text mining challenges 3. Text mining processing flow Unit 2: Text mining techniques 1. Typical text mining techniques 2. Techniques for data preparation and transformation 3. Text representation 4. Text classification 5. Introduction in topic models and BERT 6. Sentiment analysis and opinion mining Unit 3: Case Study with Python 1. Common Python libraries for Text Mining Tasks 2. Using NTLK library for text mining 3. Sentiment analysis exemplified using bag of words method and NLTK library 4. Text classification using Na ve Base The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  3. Unit 1: Introduction What is Text Mining? Text mining is a confluence of natural language processing, data mining, machine learning, and statistics used to mine knowledge from unstructured text. Generally speaking, text mining can be classified into two types: The user s questions are very clear and specific, but they do not know the answer to the questions. The user only knows the general aim but does not have specific and definite questions. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  4. Unit 1: Introduction Text Mining Challenges Natural language text is unstructured. Most data mining methods handle structured or semi-structured data=> the analysis and modeling of unstructured natural language text is challenging. Text data mining is de facto an integrated technology of natural language processing, pattern classification, and machine learning. The theoretical system of natural language processing has not yet been fully established. The main difficulties confronted in text mining are generated by: The occurrence of noise or ill-formed expressions, Ambiguous expressions in the text, Difficult collection and annotation of samples to nurture machine learning methods, Hard to express the purpose and requirements of text mining The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  5. Unit 1: Introduction Text Mining Processing Flow Text mining performs some general tasks to effectively mine texts, documents, books, comments: Data preparation and transformation Gather data Index data Text Mining Analysis The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  6. Unit 2: Text mining techniques Typical text mining techniques Text mining is a research field crossing multiple technologies and techniques: Text classification methods divide a given text into predefined text types. Text clustering techniques divide a given text set into different categories. Topic models = statistical models used to mine the topics and concepts hidden behind words in text. Text sentiment analysis (text opinion mining) reveals the subjective information expressed by a text s author, that is, the author s viewpoint and attitude. The text is classified based on attitudes expressed in the text or judgments of its positive or negative polarity. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  7. Unit 2: Text mining techniques Typical text mining techniques (2) Topic detection refers to the mining and screening of text topics (hot topics) reliable for public opinion analysis, social media computing, and personalized information services. Information Extraction refers to the extraction of factual information such as entities, entity attributes, relationships between entities, and events from unstructured and semistructured natural language text which it forms into structured data output. Automatic text summarization automatically generates summaries using natural language processing methods. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  8. Unit 2: Text mining techniques Techniques for Data Preparation and Transformation Tokenization refers to a process of segmenting a given text into lexical units. Removing stop words: Stop words mainly refer to functional words, including auxiliary words, prepositions, conjunctions, modal words, and other high frequency words. Word form normalization to improve the efficiency of text processing. Word form normalization includes two basic concepts: Lemmatization - the restoration of deformed words into original forms, to express complete semantics, Stemming - the process of removing affixes to obtain roots. Data annotation represent an essential stage of supervised machine learning methods. If the scale of annotated data is larger, the quality is higher, and if the coverage is broader, the performance of the trained model will be better. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  9. Unit 2: Text mining techniques Basics of Text Representation Vector Space Model is the simplest text representation method. Related basic concepts: Text is a sequence of characters with certain granularities, such as phrases, sentences, paragraphs, or a whole document. Term is the smallest inseparable language unit that can denote characters, words, phrases, etc. Term weight is the weight assigned to a term according to certain principles, indicating that term s importance and relevance in the text. The vector space model assumes that a text conforms to the following two requirements: (1) each term ti is unique, (2) the terms have no order. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  10. Unit 2: Text mining techniques Text Representation The goal of deep learning for text representation is to learn low-dimensional dense vectors of text at different granularities through machine learning. The bag-of-words model is the most popular text representation method in text data mining tasks such as text classification and sentiment analysis. The goal of text representation is to construct a good representation suitable for specific natural language processing tasks: For the sentiment analysis task, it is necessary to embody more emotional attributes, For topic detection and tracking tasks, more event description information must be embedded, The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  11. Unit 2: Text mining techniques Text Classification In text classification, a document must be correctly and efficiently represented for classification algorithms. The selection of a text representation method depends on the choice of classification algorithm. The main components of text classification based on traditional machine learning The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  12. Unit 2: Text mining techniques Basic Machine Learning Algorithms for Text Classification Text classification algorithms: Naive Bayes is a collection of classifiers which works on the principles of the Bayes theorem. Na ve Bayes models the joint distribution p(x, y) of the observation x and its class y. Maximum entropy (ME) assigns the joint probability to observation and label pairs (x, y) based on a log-linear model : where: is a vector of weights, f is a function that maps pairs (x, y) to a binary-value feature vector Support vector machines (SVM) is a supervised discriminative learning algorithm for binary classification. Ensemble methods combine multiple learning algorithms to obtain better predictive performance than any of the base learning algorithms alone. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  13. Unit 2: Text mining techniques Introduction in Topic Models Topic models provide a concept representation method that transforms the high-dimensional sparse vectors in the traditional vector space model into low-dimensional dense vectors to alleviate the curse of dimensionality. can better capture polysemy and synonymy and mine implicit topics (also called concepts) in texts. Basic topic models: Latent Semantic Analysis (LSA) represents a piece of text by a set of implicit semantic concepts rather than the explicit terms in the vector space model. LSA reduces the dimension of text representation by selecting k latent topics instead of m explicit terms as the basis for text representation using the following decomposition matrix: Probabilistic latent semantic analysis (PLSA) extends latent semantic analysis s algebra framework to include probability. Latent Dirichlet allocation (LDA) introduces a Dirichlet distribution to the document-conditional topic distribution and the topic-conditional term distribution. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  14. Unit 2: Text mining techniques BERT: Bidirectional Encoder Representations from Transformer BERT is a pretraining and fine-tuning model that employs the bidirectional encoder of Transformer. The representation of each input token hj is learned by attending to both the left-side context, x1, , xj 1 and the right-side context xj+1, , xn. Bidirectional contexts are crucial in tasks like sequential labeling and question answering. The contributions of BERT: BERT employs a much deeper model than GPT, and the bidirectional encoder consists of up to 24 layers with 340 million network parameters. BERT designs two unsupervised objective functions, including the masked language model and next sentence prediction. BERT is pretrained on even larger text datasets. The architecture of BERT The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  15. Unit 2: Text mining techniques Sentiment Analysis and Opinion Mining The main tasks of sentiment analysis and opinion mining include the extraction, classification, and inference of subjective information in texts, such as sentiment, opinion, attitude, emotion, stance. Sentiment analysis techniques are naturally divided into two categories: rules-based methods - perform sentiment analysis at different granularities of text based on the sentiment orientation of the words provided by a sentiment lexicon, machine learning-based methods focus on effective feature engineering for text representation and machine learning. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  16. Unit 3: Case Study with Python Common Python libraries for Text Mining NLTK (Natural Language Toolkit) includes powerful libraries for symbolic and statistical natural language processing that can work on different ML techniques. SpaCy - open-source library for NLP in Python designed for information extraction or general-purpose natural language processing. TextBlob library provides a simple API for NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. Stanford NLP contains tools useful in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, and to give a syntactic structure dependency parse, which is designed to be parallel among more than 70 languages The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  17. Unit 3: Case Study with Python 1. Using NTLK Libraries for Text Mining # install library !pip install nltk import nltk nltk.download() # install all models # Count word frequency text = "This text is process using ntlk libraries!" tokens = [t for t in text.split()] freq = nltk.FreqDist(tokens) freq FreqDist({'This': 1, 'text': 1, 'is': 1, 'process': 1, 'using': 1, 'ntlk': 1, 'libraries!': 1}) # Import stop words nltk.download('stopwords') from nltk.corpus import stopwords english_stopwords = stopwords.words('english') # Remove english stop words tokens = [t for t in text.split() if t not in english_stopwords] tokens [nltk_data] Downloading package stopwords to /root/nltk_data...! ['This', 'text', 'process', 'using', 'ntlk', 'libraries!'] The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  18. Unit 3: Case Study with Python 1. Using NTLK Libraries for Text Mining (2) # Tokenization nltk.word_tokenize("This is a text processed with text mining met hods. How is it") ['This', 'is', 'a', 'text', 'processed', 'with', 'text', 'mining', 'methods', '.', 'How', 'is', 'it'] nltk.sent_tokenize("This is a text processed with text mining met hods. How is it") ['This is a text processed with text mining methods.', 'How is it'] # Stemming and Lemmatization porter_stemmer = nltk.PorterStemmer() porter_stemmer.stem("machine learning") machine learn nltk.download('omw-1.4') wl=nltk.WordNetLemmatizer() wl.lemmatize("This is a text processed with text mining methods. How is it?") [nltk_data] Downloading package omw-1.4 to /root/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date! 'This is a text processed with text mining methods. How is it? The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  19. Unit 3: Case Study with Python 2. Sentiment analysis exemplified using Bag of words method and NLTK library #VADER Seniment Scoring #This uses a "bag of words" approach: #Stop words are removed each word is scored and combined to a tot al score. from nltk.sentiment import SentimentIntensityAnalyzer from tqdm.notebook import tqdm nltk.download('vader_lexicon') sia = SentimentIntensityAnalyzer() sia.polarity_scores('This is the best text mining sentiment analy sis. I am very happy!') {'neg': 0.0, 'neu': 0.515, 'pos': 0.485, 'compound': 0.8585} {'neg': 0.506, 'neu': 0.494, 'pos': 0.0, 'compound': -0.6249} sia.polarity_scores('This is the worst result.') The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  20. Unit 3: Case Study with Python 2. Sentiment analysis exemplified using Bag of words method and NLTK library # Run the polarity score on a dataset with comments gathered from Twitter and read from a CSV file res = {} for i, row in tqdm(df.iterrows(), total=len(df)): text = row['Text'] myid = row['Id'] res[myid] = sia.polarity_scores(text) # Import libraries import pandas as pd import numpy as np import nltk # Read data from the CSV file df = pd.read_csv('Reviews.csv') df = df.head(500) vaders = pd.DataFrame(res).T vaders = vaders.reset_index().rename(columns={'index': 'Id'}) vaders = vaders.merge(df, how='left') #Tokenize nltk.download('punkt') nltk.download('averaged_perceptron_tagger') nltk.download('maxent_ne_chunker') nltk.download('words') example = df['Text'][50] tokens = nltk.word_tokenize(example) tagged = nltk.pos_tag(tokens) entities = nltk.chunk.ne_chunk(tagged) # Plot aggregated sentiment analysis for the entire dataset fig, axs = plt.subplots(1, 2, figsize=(10, 2)) sns.barplot(data=vaders, x='Score', y='pos', ax=axs[0]) sns.barplot(data=vaders, x='Score', y='neg', ax=axs[1]) axs[0].set_title('Positive sentiment') axs[1].set_title('Negative sentiment') plt.show() The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  21. Unit 3: Case Study with Python 3. Text Classification using Na ve Base Predict the sentiment of a given review using a Na ve Bayse machine learning model. # import libraries import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB # Read the CSV File data = pd.read_csv("sentiment.csv") # Splitting Data Into Training & Testing Data train, test = train_test_split(data) #Initializing the Tfidf Vectorizer object vectorizer = TfidfVectorizer() #Vectorizing training data & preparing x_train x_train = vectorizer.fit_transform(train["review"]) y_train = train["sentiment"] # Initializing the Naive Bayes machine learning model model =MultinomialNB() # training the model model.fit(x_train, y_train) # Evaluating the trained model x_test = vectorizer.transform(test["review"]) y_test = test["sentiment"] tc=[''] # evaluate model score print("accuracy:", model.score(x_test, y_test)) # Using the trained model x_tc = vectorizer.transform(tc) print("predicted:", model.predict(x_tc)) accuracy: 0.8312 predicted: [1] The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  22. Summing up Forecasting Preprocess/clean the text data Feature extraction Improve model s performance Text mining Text representation The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  23. Self-assessment test 1. Typical text mining tasks include__________: 2. Is stemming the process of separating the prefixes and suffixes from words to derive the root word form and meaning? 3. Which one of the following word embeddings can be custom trained for a specific subject in NLP? A Word2Vec A Text categorization A TRUE B BERT B Text clustering B FALSE C GloVe C Modeling entity relation C TRUE or FALSE D All the above D All of the above The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  24. Self-assessment test : Answers 1. Typical text mining tasks include__________: 2. Is stemming the process of separating the prefixes and suffixes from words to derive the root word form and meaning? 3. Which one of the following word embeddings can be custom trained for a specific subject in NLP? A Word2Vec A Text categorization A TRUE B BERT B Text clustering B FALSE C GloVe C Modeling entity relation C TRUE or FALSE D All the above D All of the above The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  25. datascience-project.eu Thank you! The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

Related