Understanding Text Representation and Mining in Business Intelligence and Analytics
Text representation and mining play a crucial role in Business Intelligence and Analytics. Dealing with text data, understanding why text is difficult, and the importance of text preprocessing are key aspects covered in this session. Learn about the goals of text representation, the concept of Bag of Words, and the pre-processing steps involved. Gain insights into transforming unstructured text into valuable data for analysis.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Business Intelligence and Analytics: Representing and Mining Text Session 11
Agenda Text Representation Text Preprocessing Term Frequency and IDF Case Study
Dealing with Text Dataarerepresentedinwaysnaturaltoproblemsfromwhichthey werederived Vastamountoftext.. Ifwewanttoapplythemanydataminingtoolsthatwehaveatour disposal,wemust eitherengineerthedatarepresentationtomatchthetools (representationengineering),or buildnewtoolstomatchthedata
Why Text is Difficult Text is unstructured Linguisticstructureisintendedforhumancommunicationandnot computers Wordordermatterssometimes Textcanbedirty Peoplewriteungrammatically,misspellwords,abbreviateunpredictably, andpunctuaterandomly Synonyms,homograms,abbreviations,etc. Contextmatters
Text Representation Goal:Takeasetofdocuments eachofwhichisarelativelyfree- formsequenceofwords andturnitintoourfamiliarfeature-vector form Acollectionofdocumentsiscalledacorpus Adocumentiscomposedofindividualtokensorterms Eachdocumentisoneinstance butwe don t know in advance what the features will be
Bag of Words Treateverydocumentasjustacollectionofindividualwords Ignoregrammar,wordorder,sentencestructure,and(usually) punctuation Treateverywordinadocumentasapotentiallyimportantkeywordof thedocument What will be the feature s value in a given document? Eachdocumentisrepresentedbyaone(ifthetokenispresentinthe document)orazero(thetokenisnotpresentinthedocument) Straightforwardrepresentation Inexpensivetogenerate Tendstoworkwellformanytasks
Pre-processing of Text Thefollowingstepsshouldbeperformed: Thecaseshouldbenormalized Everytermisinlowercase Wordsshouldbestemmed Suffixesareremoved E.g.,nounpluralsaretransformedtosingularforms Stop-wordsshouldberemoved Astop-wordisaverycommonwordinEnglish(orwhateverlanguageis beingparsed) Typicalwordssuchasthewordsthe,and,of,andonareremoved
Term Frequency Usethewordcount(frequency)inthedocumentinsteadofjusta zeroorone Differentiatesbetweenhowmanytimesawordisused
Normalized Term Frequency Documentsofvariouslengths Wordsofdifferentfrequencies Wordsshouldnotbetoocommonortoorare Bothupperandlowerlimitonthenumber(orfraction)ofdocumentsin whichawordmayoccur Featureselectionisoftenemployed Therawtermfrequenciesarenormalizedinsomeway, suchasbydividingeachbythetotalnumberofwordsinthedocument orthefrequencyofthespecificterminthecorpus
TF-IDF TFIDF ?,? = TF ?,? IDF ? Inverse Document Frequency (IDF) of a term Totalnumber of documents Number of documents containing? IDF ? = 1 + log
TFIDF Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Example: Jazz Musicians 15prominentjazzmusiciansandexcerptsoftheirbiographiesfrom Wikipedia Nearly2,000featuresafterstemmingandstop-wordremoval! Consider the sample phrase Famous jazz saxophonistbornin Kansaswhoplayedbebopandlatin
Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Beyond Bag of Words ?-gram Sequences Named Entity Extraction Topic Models
N-gram Sequences Insomecases,wordorderisimportantandyouwanttopreserve someinformationaboutitintherepresentation Anextstepupincomplexityistoincludesequencesofadjacent wordsasterms Adjacentpairsarecommonlycalledbi-grams Example: The quick brown fox jumps Itwouldbetransformedinto{quick,brown,fox,jumps,quick_brown, brown_fox,fox_jumps} N-gramstheygreatlyincreasethesizeofthefeatureset
Topic Models Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Text Mining Example Task:predictthestockmarketbasedonthestoriesthatappearonthe newswires
Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.
Mining News Stories to Predict Stock Price Movement
Thank You Thank You
References Provost, F.; Fawcett, T.: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking. O Reilly, CA 95472, 2013. Eibe Frank, Mark A. Hall, and Ian H. Witten : The Weka Workbench, M Morgan Kaufman Elsevier, 2016. Jason Brownlee, Machine Learning Mastery With Weka, E-Book, 2017 Sharda, R., Delen, D., Turban, E., (2018). Business intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson.