Understanding Text Representation and Mining in Business Intelligence and Analytics

Slide Note
Embed
Share

Text representation and mining play a crucial role in Business Intelligence and Analytics. Dealing with text data, understanding why text is difficult, and the importance of text preprocessing are key aspects covered in this session. Learn about the goals of text representation, the concept of Bag of Words, and the pre-processing steps involved. Gain insights into transforming unstructured text into valuable data for analysis.


Uploaded on Sep 20, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Business Intelligence and Analytics: Representing and Mining Text Session 11

  2. Agenda Text Representation Text Preprocessing Term Frequency and IDF Case Study

  3. Dealing with Text Dataarerepresentedinwaysnaturaltoproblemsfromwhichthey werederived Vastamountoftext.. Ifwewanttoapplythemanydataminingtoolsthatwehaveatour disposal,wemust eitherengineerthedatarepresentationtomatchthetools (representationengineering),or buildnewtoolstomatchthedata

  4. Why Text is Difficult Text is unstructured Linguisticstructureisintendedforhumancommunicationandnot computers Wordordermatterssometimes Textcanbedirty Peoplewriteungrammatically,misspellwords,abbreviateunpredictably, andpunctuaterandomly Synonyms,homograms,abbreviations,etc. Contextmatters

  5. Text Representation Goal:Takeasetofdocuments eachofwhichisarelativelyfree- formsequenceofwords andturnitintoourfamiliarfeature-vector form Acollectionofdocumentsiscalledacorpus Adocumentiscomposedofindividualtokensorterms Eachdocumentisoneinstance butwe don t know in advance what the features will be

  6. Bag of Words Treateverydocumentasjustacollectionofindividualwords Ignoregrammar,wordorder,sentencestructure,and(usually) punctuation Treateverywordinadocumentasapotentiallyimportantkeywordof thedocument What will be the feature s value in a given document? Eachdocumentisrepresentedbyaone(ifthetokenispresentinthe document)orazero(thetokenisnotpresentinthedocument) Straightforwardrepresentation Inexpensivetogenerate Tendstoworkwellformanytasks

  7. Pre-processing of Text Thefollowingstepsshouldbeperformed: Thecaseshouldbenormalized Everytermisinlowercase Wordsshouldbestemmed Suffixesareremoved E.g.,nounpluralsaretransformedtosingularforms Stop-wordsshouldberemoved Astop-wordisaverycommonwordinEnglish(orwhateverlanguageis beingparsed) Typicalwordssuchasthewordsthe,and,of,andonareremoved

  8. Term Frequency Usethewordcount(frequency)inthedocumentinsteadofjusta zeroorone Differentiatesbetweenhowmanytimesawordisused

  9. Normalized Term Frequency Documentsofvariouslengths Wordsofdifferentfrequencies Wordsshouldnotbetoocommonortoorare Bothupperandlowerlimitonthenumber(orfraction)ofdocumentsin whichawordmayoccur Featureselectionisoftenemployed Therawtermfrequenciesarenormalizedinsomeway, suchasbydividingeachbythetotalnumberofwordsinthedocument orthefrequencyofthespecificterminthecorpus

  10. TF-IDF TFIDF ?,? = TF ?,? IDF ? Inverse Document Frequency (IDF) of a term Totalnumber of documents Number of documents containing? IDF ? = 1 + log

  11. TFIDF Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  12. Example: Jazz Musicians 15prominentjazzmusiciansandexcerptsoftheirbiographiesfrom Wikipedia Nearly2,000featuresafterstemmingandstop-wordremoval! Consider the sample phrase Famous jazz saxophonistbornin Kansaswhoplayedbebopandlatin

  13. Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  14. Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  15. Example: Jazz Musicians Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  16. Example: Jazz Musicians

  17. Beyond Bag of Words ?-gram Sequences Named Entity Extraction Topic Models

  18. N-gram Sequences Insomecases,wordorderisimportantandyouwanttopreserve someinformationaboutitintherepresentation Anextstepupincomplexityistoincludesequencesofadjacent wordsasterms Adjacentpairsarecommonlycalledbi-grams Example: The quick brown fox jumps Itwouldbetransformedinto{quick,brown,fox,jumps,quick_brown, brown_fox,fox_jumps} N-gramstheygreatlyincreasethesizeofthefeatureset

  19. Topic Models Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  20. Text Mining Example Task:predictthestockmarketbasedonthestoriesthatappearonthe newswires

  21. Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  22. Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  23. Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  24. Mining News Stories to Predict Stock Price Movement Source: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking.

  25. Mining News Stories to Predict Stock Price Movement

  26. Thank You Thank You

  27. References Provost, F.; Fawcett, T.: Data Science for Business; Fundamental Principles of Data Mining and Data- Analytic Thinking. O Reilly, CA 95472, 2013. Eibe Frank, Mark A. Hall, and Ian H. Witten : The Weka Workbench, M Morgan Kaufman Elsevier, 2016. Jason Brownlee, Machine Learning Mastery With Weka, E-Book, 2017 Sharda, R., Delen, D., Turban, E., (2018). Business intelligence, Analytics, and Data Science: A Managerial Perspective, 4th Edition, Pearson.

Related