Text Mining

Text Mining
By [partner]
Index
Unit 1: Introduction
1.
What is 
t
ext 
m
ining?
2.
Text 
mining
 
challenges
3.
Text mining processing flow
 
Unit 2: Text mining techniques
1.
Typical text mining techniques
2.
Techniques for data preparation and
transformation
3.
Text representation
4.
Text classification
5.
Introduction in topic models and BERT
6.
Sentiment analysis and opinion mining
Unit 3: Case Study with Python
1.
Common Python libraries for Text
Mining Tasks
2.
Using NTLK library for text mining
3.
Sentiment analysis exemplified
using bag of words method and
NLTK library
4.
Text classification using Naïve
Base
1
1
2
2
3
3
Unit
 1: 
Introduction
What is Text 
M
ining?
Text mining is a confluence of natural language processing, data mining, machine learning, and
statistics used to mine knowledge from unstructured text.
Generally speaking, text mining can be classified into two types:
The user’s questions are very clear and specific, but they do not know the answer to the
questions.
The user only knows the general aim but does not have specific and definite questions.
 
Unit 1: Introduction
Text Mining Challenges
Natural language text is unstructured.
Most data mining methods handle structured or semi-structured data=> the analysis and
modeling of unstructured natural language text is challenging.
Text data mining is de facto an integrated technology of natural language processing, pattern
classification, and machine learning.
The theoretical system of natural language processing has not yet been fully established.
The 
main difficulties 
confronted in text mining are generated by:
The occurrence of noise or ill-formed expressions,
Ambiguous expressions in the text,
Difficult collection and annotation of samples to nurture machine learning methods,
Hard to express the purpose and requirements of text mining
 
Unit 1: Introduction
Text Mining Processing Flow
Text mining performs some general tasks to effectively mine texts, documents, books, comments:
 
Unit 2: Text mining techniques
Typical text mining techniques
Text mining is a research field crossing multiple technologies and techniques:
Text classification 
methods divide a given text into predefined text types.
Text clustering 
techniques divide a given text set into different categories.
Topic models 
= statistical models used to mine the topics and concepts hidden behind words
in text.
Text sentiment analysis (text opinion mining) 
reveals the subjective information expressed by
a text’s author, that is, the author’s viewpoint and attitude.
 The t
ext is classified based on
attitudes expressed in the text or judgments
 
of its positive or negative polarity.
Unit 2: Text mining techniques
Typical text mining techniques (2)
Topic detection 
refers to the mining and screening of text topics (hot topics) reliable for  public
opinion analysis, social media computing, and personalized information services
.
Information Extraction 
refers 
to the extraction of factual information such as entities, entity
attributes, relationships between entities, and events from unstructured and semistructured
natural language text  which it forms into structured data output.
Automatic text summarization 
automatically generates summaries using natural language
processing methods.
Unit 2: Text mining techniques
Techniques for Data Preparation and Transformation
Tokenization
 refers to a process of segmenting a given text into lexical units
.
Removing stop words
: Stop words mainly refer to functional words, including auxiliary words,
prepositions, conjunctions, modal words, and other high frequency words.
Word form normalization 
to 
improve the efficiency of text processing
. 
Word form normalization
includes two basic concepts:
Lemmatization
 - the restoration of deformed words into original forms, to express complete
semantics,
Stemming 
- the process of removing affixes to obtain roots.
Data annotation 
represent an essential stage of 
supervised machine 
learning methods. If the scale
of annotated data is larger, the quality is higher, and if the coverage is broader, the performance of
the trained model will be better.
Unit 2: Text mining techniques
Basics of Text Representation
Related 
basic concepts
:
Text
 is a sequence of characters with certain granularities, such as phrases, sentences,
paragraphs, or a whole document.
Term 
is the smallest inseparable language unit that can denote characters, words, phrases,
etc.
Term weight 
is the weight assigned to a term according to certain principles, indicating that
term’s importance and relevance in the text.
Vector Space Model 
is the simplest text representation method.
The vector space model assumes that a text conforms to the following two requirements:
(1) each term 
ti 
is unique, (2) the terms have no order.
Unit 2: Text mining techniques
Text Representation
The goal of text representation is to construct a good representation suitable for
specific natural language processing 
tasks:
For the 
sentiment analysis t
ask, it is necessary to embody more emotional attributes
,
For 
topic detection 
and tracking tasks, more event description information must be
embedded,
The goal of 
deep learning for text representation 
is to learn low-dimensional dense
vectors of text at different granularities through machine learning.
The 
bag-of-words model 
is the most popular text representation method in text data
mining tasks such as text classification and sentiment analysis.
Unit 2: Text mining techniques
Text Classification
In text classification, a document must be correctly and efficiently represented for
classification algorithms.
The selection of a text representation method depends on the choice of classification
algorithm.
The main components of text classification based on traditional machine learning
Unit 2: Text mining techniques
Basic Machine Learning Algorithms for Text Classification
Text classification 
algorithms
:
Naive Bayes 
is a collection of classifiers which works on the principles of the Bayes’ theorem.
Naïve Bayes
 
models the joint distribution p(x, y) of the observation x and its class y.
Maximum entropy (ME) 
assigns the joint probability to observation and label pairs (x, y)
based on a log-linear model :
 
where: θ is a vector of weights, f is a function that maps pairs (x, y) to a binary-value
feature vector
Support vector machines (SVM) 
is a supervised discriminative learning algorithm for binary
classification.
Ensemble methods 
combine multiple learning algorithms to obtain better predictive
performance than any of the base learning algorithms alone.
Unit 2: Text mining techniques
Introduction in Topic Models
Basic 
topic models
:
Latent Semantic Analysis (LSA) 
represents a piece of text by a set of implicit semantic concepts
rather than the explicit terms in the vector space model. LSA reduces the dimension of text
representation by selecting k latent topics instead of m explicit terms as
the basis for text representation using the following decomposition matrix:
Probabilistic latent semantic analysis (PLSA) 
extends latent semantic
analysis’s algebra framework to include probability.
Latent Dirichlet allocation (LDA) 
introduces a Dirichlet distribution to the document-conditional
topic distribution and the topic-conditional term distribution.
Topic models 
provide a concept representation method that transforms the high-dimensional sparse
vectors in the traditional vector space model into low-dimensional dense vectors to alleviate the
curse of dimensionality
. 
can better capture polysemy and synonymy and mine implicit topics (also
called concepts) in texts.
Unit 2: Text mining techniques
BERT: Bidirectional Encoder Representations from Transformer
The
 representation of each input token 
h
j 
is learned by attending to both the left-side context
, x
1
,
· · · 
, xj
−1 and the right-side context 
xj
+1
, 
· · · 
, xn
.
BERT is a pretraining and fine-tuning model that employs the bidirectional encoder of Transformer.
The architecture of BERT
Bidirectional contexts are crucial in tasks like sequential
labeling and question answering.
The contributions of BERT:
BERT employs a much deeper model than GPT, and
the bidirectional encoder consists of up to 24 layers
with 340 million network parameters.
BERT designs two unsupervised objective functions,
including the masked language model and next
sentence prediction.
BERT is pretrained on even larger text datasets.
Unit 2: Text mining techniques
Sentiment Analysis and Opinion Mining
The 
main tasks 
of sentiment analysis and opinion mining include the extraction, classification,
and inference of subjective information in texts, such as sentiment, opinion, attitude, emotion,
stance.
Sentiment analysis techniques are naturally divided into two 
categories
:
rules-based methods 
- perform sentiment analysis at different granularities of text based
on the sentiment orientation of the words provided by a sentiment lexicon,
machine learning-based methods
 focus on effective feature engineering for text
representation and machine learning.
Unit 3: Case Study with Python
Common Python libraries for Text Mining
NLTK
 
(Natural 
L
anguage Toolkit)
 – includes powerful libraries for symbolic and statistical natural language processing that
can work on different ML techniques.
SpaCy - 
open-source library for NLP in Python designed for information extraction or general-purpose natural language
processing.
TextBlob 
library 
provides a simple API for NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment
analysis, classification, translation, and more.
Stanford NLP 
 
contains tools useful in a pipeline, to convert a string containing human language text into lists of sentences
and words, to generate base forms of those words, their parts of speech and morphological features, and to give a
syntactic structure dependency parse, which is designed to be parallel among more than 70 languages
Unit 3: Case Study with Python
1. 
Using NTLK Libraries for Text Mining
# install library
!
pip install nltk
import
 nltk
nltk.download() 
# install all models
# Count word frequency
text = 
"This text is process using ntlk libraries!"
tokens = [t 
for
 t 
in
 text.split()]
freq = nltk.FreqDist(tokens)
freq
FreqDist({'This': 1, 'text': 1, 'is': 1, 'process': 1, 'using':
1, 'ntlk': 1, 'libraries!': 1})
# Import stop words
nltk.download(
'stopwords'
)
from
 nltk.corpus 
import
 stopwords
english_stopwords = stopwords.words(
'english'
)
# Remove english stop words
tokens = [t 
for
 t 
in
 text.split() 
if
 t 
not
 
in
 english_stopwords]
tokens
[nltk_data] Downloading package stopwords to /root/nltk_data...!
['This', 'text', 'process', 'using', 'ntlk', 'libraries!']
# Tokenization
nltk.word_tokenize(
"This is a text processed with text mining met
hods. How is it"
)
Unit 3: Case Study with Python
1. 
Using NTLK Libraries for Text Mining (2)
['This', 'is', 'a', 'text', 'processed', 'with',
'text', 'mining', 'methods', '.', 'How', 'is',
'it']
nltk.sent_tokenize(
"This is a text processed with text mining met
hods. How is it"
)
['This is a text processed with text
mining methods.', 'How is it']
# Stemming and Lemmatization
porter_stemmer = nltk.PorterStemmer()
porter_stemmer.stem(
"machine learning"
)
machine learn
nltk.download(
'omw-1.4'
)
wl=nltk.WordNetLemmatizer()
wl.lemmatize(
"This is a text processed with text mining methods.
How is it?"
)
[
nltk_data] Downloading package omw-1.4 to /root/nltk_data... [nltk_data] Package
omw-1.4 is already up-to-date!
'This is a text processed with text mining methods. How is it?
Unit 3: Case Study with Python
2. 
Sentiment analysis exemplified using Bag of words 
method and NLTK library
#VADER Seniment Scoring
#This uses a "bag of words" approach:
#Stop words are removed each word is scored and combined to a tot
al score.
from
 nltk.sentiment 
import
 SentimentIntensityAnalyzer
from
 tqdm.notebook 
import
 tqdm
nltk.download(
'vader_lexicon'
)
sia = SentimentIntensityAnalyzer()
sia.polarity_scores(
'This is the best text mining sentiment analy
sis. I am very happy!'
)
{'neg': 0.0, 'neu': 0.515, 'pos':
0.485, 'compound': 0.8585}
sia.polarity_scores(
'This is the worst result.'
)
{'neg': 0.506, 'neu': 0.494, 'pos':
0.0, 'compound': -0.6249}
# Import libraries
import
 pandas 
as
 pd
import
 numpy 
as
 np
import
 nltk
# Read data from the CSV file
df = pd.read_csv(
'Reviews.csv'
)
df = df.head(
500
)
#Tokenize
nltk.download(
'punkt'
)
nltk.download(
'averaged_perceptron_tagger'
)
nltk.download(
'maxent_ne_chunker'
)
nltk.download(
'words'
)
example = df[
'Text'
][
50
]
tokens = nltk.word_tokenize(example)
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)
# Run the polarity score on a dataset with comments gathered from
 Twitter and read from a CSV file
res = {}
for
 i, row 
in
 tqdm(df.iterrows(), total=
len
(df)):
    text = row[
'Text'
]
    myid = row[
'Id'
]
    res[myid] = sia.polarity_scores(text)
vaders = pd.DataFrame(res).T
vaders = vaders.reset_index().rename(columns={
'index'
'Id'
})
vaders = vaders.merge(df, how=
'left'
)
# Plot aggregated sentiment analysis for the entire dataset
fig, axs = plt.subplots(
1
2
, figsize=(
10
2
))
sns.barplot(data=vaders, x=
'Score'
, y=
'pos'
, ax=axs[
0
])
sns.barplot(data=vaders, x=
'Score'
, y=
'neg'
, ax=axs[
1
])
axs[
0
].set_title(
'Positive sentiment'
)
axs[
1
].set_title(
'Negative sentiment'
)
plt.show()
Unit 3: Case Study with Python
2. 
Sentiment analysis exemplified using Bag of words 
method and NLTK library
Unit 3: Case Study with Python
3. Text Classification using Naïve Base
Predict the sentiment of a given review using a Naïve Bayse machine learning model.
# import libraries
import
 pandas 
as
 pd
import
 numpy 
as
 np
from
 sklearn.model_selection 
import
 train_test_split
from
 sklearn.feature_extraction.text 
import
 TfidfVectorizer
from
 sklearn.naive_bayes 
import
 MultinomialNB
# Read the CSV File
data = pd.read_csv(
"sentiment.csv"
)
# Splitting Data Into Training & Testing Data
train, test = train_test_split(data)
#Initializing the Tfidf Vectorizer object
vectorizer = TfidfVectorizer()
#Vectorizing training data & preparing x_train
x_train = vectorizer.fit_transform(train[
"review"
])
y_train = train[
"sentiment"
]
# Initializing the Naive Bayes machine learning model
model =MultinomialNB()
# training the model
model.fit(x_train, y_train)
# Evaluating the trained model
x_test = vectorizer.transform(test[
"review"
])
y_test = test[
"sentiment"
]
tc=[
''
]
# evaluate model score
print
(
"accuracy:"
, model.score(x_test, y_test))
# Using the trained model
x_tc = vectorizer.transform(tc)
print
(
"predicted:"
, model.predict(x_tc))
accuracy: 0.8312 predicted: [1]
Summing up
Preprocess/clean the text
data
Text representation
Feature extraction
Text 
mining
Forecasting
Improve
 
model’s
performance
Self-assessment test
1.
Typical text mining tasks
include__________:
 
 
 
 
A  Text categorization
 
B  Text clustering
 
C  Modeling entity relation
 
D  All of the above
2. Is 
stemming the process of
separating the prefixes and
suffixes from words to derive
the root word form and
meaning?
 
A  TRUE
 
B  FALSE
 
C  TRUE or FALSE
3. 
Which one of the
following word embeddings
can be custom trained for a
specific subject in NLP?
 
A  Word2Vec
 
B  BERT
 
C  GloVe
 
D  All the above
Self-assessment test : Answers
1.
Typical text mining tasks
include__________:
 
 
 
 
A  Text categorization
 
B  Text clustering
 
C  Modeling entity relation
 
D  All of the above
2. Is 
stemming the process of
separating the prefixes and
suffixes from words to derive
the root word form and
meaning?
 
A  TRUE
 
B  FALSE
 
C  TRUE or FALSE
3. 
Which one of the
following word embeddings
can be custom trained for a
specific subject in NLP?
 
A  Word2Vec
 
B  BERT
 
C  GloVe
 
D  All the above
Thank you!
Slide Note
Embed
Share

Text mining and its challenges, explore different techniques and tools, and see a case study using Python for sentiment analysis and text classification.

  • text mining
  • natural language processing
  • data mining
  • machine learning
  • text classification
  • sentiment analysis
  • Python
  • case study

Uploaded on Dec 21, 2023 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. datascience-project.eu Text Mining By [partner] The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  2. Index 1 3 2 Unit 1: Introduction 1. What is text mining? 2. Text mining challenges 3. Text mining processing flow Unit 2: Text mining techniques 1. Typical text mining techniques 2. Techniques for data preparation and transformation 3. Text representation 4. Text classification 5. Introduction in topic models and BERT 6. Sentiment analysis and opinion mining Unit 3: Case Study with Python 1. Common Python libraries for Text Mining Tasks 2. Using NTLK library for text mining 3. Sentiment analysis exemplified using bag of words method and NLTK library 4. Text classification using Na ve Base The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  3. Unit 1: Introduction What is Text Mining? Text mining is a confluence of natural language processing, data mining, machine learning, and statistics used to mine knowledge from unstructured text. Generally speaking, text mining can be classified into two types: The user s questions are very clear and specific, but they do not know the answer to the questions. The user only knows the general aim but does not have specific and definite questions. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  4. Unit 1: Introduction Text Mining Challenges Natural language text is unstructured. Most data mining methods handle structured or semi-structured data=> the analysis and modeling of unstructured natural language text is challenging. Text data mining is de facto an integrated technology of natural language processing, pattern classification, and machine learning. The theoretical system of natural language processing has not yet been fully established. The main difficulties confronted in text mining are generated by: The occurrence of noise or ill-formed expressions, Ambiguous expressions in the text, Difficult collection and annotation of samples to nurture machine learning methods, Hard to express the purpose and requirements of text mining The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  5. Unit 1: Introduction Text Mining Processing Flow Text mining performs some general tasks to effectively mine texts, documents, books, comments: Data preparation and transformation Gather data Index data Text Mining Analysis The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  6. Unit 2: Text mining techniques Typical text mining techniques Text mining is a research field crossing multiple technologies and techniques: Text classification methods divide a given text into predefined text types. Text clustering techniques divide a given text set into different categories. Topic models = statistical models used to mine the topics and concepts hidden behind words in text. Text sentiment analysis (text opinion mining) reveals the subjective information expressed by a text s author, that is, the author s viewpoint and attitude. The text is classified based on attitudes expressed in the text or judgments of its positive or negative polarity. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  7. Unit 2: Text mining techniques Typical text mining techniques (2) Topic detection refers to the mining and screening of text topics (hot topics) reliable for public opinion analysis, social media computing, and personalized information services. Information Extraction refers to the extraction of factual information such as entities, entity attributes, relationships between entities, and events from unstructured and semistructured natural language text which it forms into structured data output. Automatic text summarization automatically generates summaries using natural language processing methods. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  8. Unit 2: Text mining techniques Techniques for Data Preparation and Transformation Tokenization refers to a process of segmenting a given text into lexical units. Removing stop words: Stop words mainly refer to functional words, including auxiliary words, prepositions, conjunctions, modal words, and other high frequency words. Word form normalization to improve the efficiency of text processing. Word form normalization includes two basic concepts: Lemmatization - the restoration of deformed words into original forms, to express complete semantics, Stemming - the process of removing affixes to obtain roots. Data annotation represent an essential stage of supervised machine learning methods. If the scale of annotated data is larger, the quality is higher, and if the coverage is broader, the performance of the trained model will be better. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  9. Unit 2: Text mining techniques Basics of Text Representation Vector Space Model is the simplest text representation method. Related basic concepts: Text is a sequence of characters with certain granularities, such as phrases, sentences, paragraphs, or a whole document. Term is the smallest inseparable language unit that can denote characters, words, phrases, etc. Term weight is the weight assigned to a term according to certain principles, indicating that term s importance and relevance in the text. The vector space model assumes that a text conforms to the following two requirements: (1) each term ti is unique, (2) the terms have no order. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  10. Unit 2: Text mining techniques Text Representation The goal of deep learning for text representation is to learn low-dimensional dense vectors of text at different granularities through machine learning. The bag-of-words model is the most popular text representation method in text data mining tasks such as text classification and sentiment analysis. The goal of text representation is to construct a good representation suitable for specific natural language processing tasks: For the sentiment analysis task, it is necessary to embody more emotional attributes, For topic detection and tracking tasks, more event description information must be embedded, The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  11. Unit 2: Text mining techniques Text Classification In text classification, a document must be correctly and efficiently represented for classification algorithms. The selection of a text representation method depends on the choice of classification algorithm. The main components of text classification based on traditional machine learning The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  12. Unit 2: Text mining techniques Basic Machine Learning Algorithms for Text Classification Text classification algorithms: Naive Bayes is a collection of classifiers which works on the principles of the Bayes theorem. Na ve Bayes models the joint distribution p(x, y) of the observation x and its class y. Maximum entropy (ME) assigns the joint probability to observation and label pairs (x, y) based on a log-linear model : where: is a vector of weights, f is a function that maps pairs (x, y) to a binary-value feature vector Support vector machines (SVM) is a supervised discriminative learning algorithm for binary classification. Ensemble methods combine multiple learning algorithms to obtain better predictive performance than any of the base learning algorithms alone. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  13. Unit 2: Text mining techniques Introduction in Topic Models Topic models provide a concept representation method that transforms the high-dimensional sparse vectors in the traditional vector space model into low-dimensional dense vectors to alleviate the curse of dimensionality. can better capture polysemy and synonymy and mine implicit topics (also called concepts) in texts. Basic topic models: Latent Semantic Analysis (LSA) represents a piece of text by a set of implicit semantic concepts rather than the explicit terms in the vector space model. LSA reduces the dimension of text representation by selecting k latent topics instead of m explicit terms as the basis for text representation using the following decomposition matrix: Probabilistic latent semantic analysis (PLSA) extends latent semantic analysis s algebra framework to include probability. Latent Dirichlet allocation (LDA) introduces a Dirichlet distribution to the document-conditional topic distribution and the topic-conditional term distribution. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  14. Unit 2: Text mining techniques BERT: Bidirectional Encoder Representations from Transformer BERT is a pretraining and fine-tuning model that employs the bidirectional encoder of Transformer. The representation of each input token hj is learned by attending to both the left-side context, x1, , xj 1 and the right-side context xj+1, , xn. Bidirectional contexts are crucial in tasks like sequential labeling and question answering. The contributions of BERT: BERT employs a much deeper model than GPT, and the bidirectional encoder consists of up to 24 layers with 340 million network parameters. BERT designs two unsupervised objective functions, including the masked language model and next sentence prediction. BERT is pretrained on even larger text datasets. The architecture of BERT The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  15. Unit 2: Text mining techniques Sentiment Analysis and Opinion Mining The main tasks of sentiment analysis and opinion mining include the extraction, classification, and inference of subjective information in texts, such as sentiment, opinion, attitude, emotion, stance. Sentiment analysis techniques are naturally divided into two categories: rules-based methods - perform sentiment analysis at different granularities of text based on the sentiment orientation of the words provided by a sentiment lexicon, machine learning-based methods focus on effective feature engineering for text representation and machine learning. The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  16. Unit 3: Case Study with Python Common Python libraries for Text Mining NLTK (Natural Language Toolkit) includes powerful libraries for symbolic and statistical natural language processing that can work on different ML techniques. SpaCy - open-source library for NLP in Python designed for information extraction or general-purpose natural language processing. TextBlob library provides a simple API for NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. Stanford NLP contains tools useful in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, and to give a syntactic structure dependency parse, which is designed to be parallel among more than 70 languages The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  17. Unit 3: Case Study with Python 1. Using NTLK Libraries for Text Mining # install library !pip install nltk import nltk nltk.download() # install all models # Count word frequency text = "This text is process using ntlk libraries!" tokens = [t for t in text.split()] freq = nltk.FreqDist(tokens) freq FreqDist({'This': 1, 'text': 1, 'is': 1, 'process': 1, 'using': 1, 'ntlk': 1, 'libraries!': 1}) # Import stop words nltk.download('stopwords') from nltk.corpus import stopwords english_stopwords = stopwords.words('english') # Remove english stop words tokens = [t for t in text.split() if t not in english_stopwords] tokens [nltk_data] Downloading package stopwords to /root/nltk_data...! ['This', 'text', 'process', 'using', 'ntlk', 'libraries!'] The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  18. Unit 3: Case Study with Python 1. Using NTLK Libraries for Text Mining (2) # Tokenization nltk.word_tokenize("This is a text processed with text mining met hods. How is it") ['This', 'is', 'a', 'text', 'processed', 'with', 'text', 'mining', 'methods', '.', 'How', 'is', 'it'] nltk.sent_tokenize("This is a text processed with text mining met hods. How is it") ['This is a text processed with text mining methods.', 'How is it'] # Stemming and Lemmatization porter_stemmer = nltk.PorterStemmer() porter_stemmer.stem("machine learning") machine learn nltk.download('omw-1.4') wl=nltk.WordNetLemmatizer() wl.lemmatize("This is a text processed with text mining methods. How is it?") [nltk_data] Downloading package omw-1.4 to /root/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date! 'This is a text processed with text mining methods. How is it? The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  19. Unit 3: Case Study with Python 2. Sentiment analysis exemplified using Bag of words method and NLTK library #VADER Seniment Scoring #This uses a "bag of words" approach: #Stop words are removed each word is scored and combined to a tot al score. from nltk.sentiment import SentimentIntensityAnalyzer from tqdm.notebook import tqdm nltk.download('vader_lexicon') sia = SentimentIntensityAnalyzer() sia.polarity_scores('This is the best text mining sentiment analy sis. I am very happy!') {'neg': 0.0, 'neu': 0.515, 'pos': 0.485, 'compound': 0.8585} {'neg': 0.506, 'neu': 0.494, 'pos': 0.0, 'compound': -0.6249} sia.polarity_scores('This is the worst result.') The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  20. Unit 3: Case Study with Python 2. Sentiment analysis exemplified using Bag of words method and NLTK library # Run the polarity score on a dataset with comments gathered from Twitter and read from a CSV file res = {} for i, row in tqdm(df.iterrows(), total=len(df)): text = row['Text'] myid = row['Id'] res[myid] = sia.polarity_scores(text) # Import libraries import pandas as pd import numpy as np import nltk # Read data from the CSV file df = pd.read_csv('Reviews.csv') df = df.head(500) vaders = pd.DataFrame(res).T vaders = vaders.reset_index().rename(columns={'index': 'Id'}) vaders = vaders.merge(df, how='left') #Tokenize nltk.download('punkt') nltk.download('averaged_perceptron_tagger') nltk.download('maxent_ne_chunker') nltk.download('words') example = df['Text'][50] tokens = nltk.word_tokenize(example) tagged = nltk.pos_tag(tokens) entities = nltk.chunk.ne_chunk(tagged) # Plot aggregated sentiment analysis for the entire dataset fig, axs = plt.subplots(1, 2, figsize=(10, 2)) sns.barplot(data=vaders, x='Score', y='pos', ax=axs[0]) sns.barplot(data=vaders, x='Score', y='neg', ax=axs[1]) axs[0].set_title('Positive sentiment') axs[1].set_title('Negative sentiment') plt.show() The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  21. Unit 3: Case Study with Python 3. Text Classification using Na ve Base Predict the sentiment of a given review using a Na ve Bayse machine learning model. # import libraries import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB # Read the CSV File data = pd.read_csv("sentiment.csv") # Splitting Data Into Training & Testing Data train, test = train_test_split(data) #Initializing the Tfidf Vectorizer object vectorizer = TfidfVectorizer() #Vectorizing training data & preparing x_train x_train = vectorizer.fit_transform(train["review"]) y_train = train["sentiment"] # Initializing the Naive Bayes machine learning model model =MultinomialNB() # training the model model.fit(x_train, y_train) # Evaluating the trained model x_test = vectorizer.transform(test["review"]) y_test = test["sentiment"] tc=[''] # evaluate model score print("accuracy:", model.score(x_test, y_test)) # Using the trained model x_tc = vectorizer.transform(tc) print("predicted:", model.predict(x_tc)) accuracy: 0.8312 predicted: [1] The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  22. Summing up Forecasting Preprocess/clean the text data Feature extraction Improve model s performance Text mining Text representation The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  23. Self-assessment test 1. Typical text mining tasks include__________: 2. Is stemming the process of separating the prefixes and suffixes from words to derive the root word form and meaning? 3. Which one of the following word embeddings can be custom trained for a specific subject in NLP? A Word2Vec A Text categorization A TRUE B BERT B Text clustering B FALSE C GloVe C Modeling entity relation C TRUE or FALSE D All the above D All of the above The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  24. Self-assessment test : Answers 1. Typical text mining tasks include__________: 2. Is stemming the process of separating the prefixes and suffixes from words to derive the root word form and meaning? 3. Which one of the following word embeddings can be custom trained for a specific subject in NLP? A Word2Vec A Text categorization A TRUE B BERT B Text clustering B FALSE C GloVe C Modeling entity relation C TRUE or FALSE D All the above D All of the above The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

  25. datascience-project.eu Thank you! The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein."

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#