Exploring Sources, Tools, and Datasets in Text Mining

Slide Note
Embed
Share

Discover a plethora of sources, tools, and datasets in text mining through resources shared by Bettina Berendt and references from lectures and publications. Uncover DH-specific tools and powerful NLP tools like Ling Pipe, OpenNLP, Stanford Parser, and NLTK Toolkit for text analysis and processing.


Uploaded on Sep 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Sources, tools, datasets (thanks in advance for helping me extend this by sending tools you ve found!) Bettina Berendt Vienna Summer School 2015 (full slidesets to follow!)

  2. 2 Lecture 1 2

  3. 3 References A good textbook on Text Mining: Feldman, R. & Sanger, J. (2007). The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. An introduction similar to this one, but also covering unsupervised learning in some detail, and with lots of pointers to books, materials, etc.: Shaw, R. (2012). Text-mining as a Research Tool in the Humanities and Social Sciences. Presentation at the Duke Libraries, September 20, 2012. https://aeshin.org/textmining/ An overview of news and (micro-)blogs mining: Berendt, B. (in press). Text mining for news and blogs analysis. To appear in C. Sammut & G.I. Webb (Eds.), Encyclopedia of Machine Learning and Data Mining. Berlin etc.: Springer. http://people.cs.kuleuven.be/~bettina.berendt/Papers/berendt_encyclopedia_2015_with_publication_info.pdf See http://wiki.esi.ac.uk/Current_Approaches_to_Data_Mining_Blogs for more articles on the subject. Individual sources cited on the slides Qiaozhu Mei, ChengXiang Zhai: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. KDD 2005: 198-207 Mihalcea, R. & Liu, H. (2006). A corpus-based approach to finding happiness, In Proc. AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.6759 Kirschenbaum, M. "The Remaking of Reading: Data Mining and the Digital Humanities." In NGDM 07: National Science Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation. http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf Mueller, M. Notes towards a user manual of MONK. https://apps.lis.uiuc.edu/wiki/display/MONK/Notes+towards+a+user+manual+of+Monk, 2007. Massimo Poesio, Jon Chamberlain, Udo Kruschwitz, Livio Robaldo and Luca Ducceschi, 2013. Phrase Detectives: Utilizing Collective Intelligence for Internet-Scale Language Resource Creation. ACM Transactions on Intelligent Interactive Systems, 3(1). http://csee.essex.ac.uk/poesio/publications/poesio_et_al_ACM_TIIS_13.pdf Luis von Ahn (2005). Human Computation. PhD Dissertation. Computer Science Department, Carnegie Mellon University. http://reports-archive.adm.cs.cmu.edu/anon/usr0/ftp/usr/ftp/2005/abstracts/05-193.html Luis von Ahn: Games with a Purpose. IEEE Computer 39(6): 92-94 (2006)

  4. 4 More DH-specific tools Overviews of 71 tools for Digital Humanists Simpson, J., Rockwell, G., Chartier, R., Sinclair, S., Brown, S., Dyrbye, A., & Uszkalo, K. (2013). Text Mining Tools in the Humanities: An Analysis Framework. Journal of Digital Humanities, 2 (3), http://journalofdigitalhumanities.org/2- 3/text-mining-tools-in-the-humanities-an- analysis-framework/ See also the link collection on the Voyant documentation Web page 4

  5. 5 Tools (powerful, but require some computing experience) Ling Pipe linguistic processing of text including entity extraction, clustering and classification, etc. http://alias-i.com/lingpipe/ OpenNLP the most common NLP tasks, such as POS tagging, named entity extraction, chunking and coreference resolution. http://opennlp.apache.org/ Stanford Parser and Part-of-Speech (POS) Tagger http://nlp.stanford.edu/software/tagger.shtm/ NTLK Toolkit for teaching and researching classification, clustering and parsing http://www.nltk.org/ OpinionFinder subjective sentences , source (holder) of the subjectivity and words that are included in phrases expressing positive or negative sentiments. http://code.google.com/p/opinionfinder/ Basic sentiment tokenizer plus some tools, by Christopher Potts http://sentiment.christopherpotts.net Twitter NLP and Part-of-speech tagging http://www.ark.cs.cmu.edu/TweetNLP/

  6. 6 Further tools (thanks for your suggestions!) Atlas TI: Qualitative data analysis http://atlasti.com/ Commercial product, has free trial version 6

  7. 7 Gamergate sources Budac, A., Chartier, R., Suomela, T., Gouglas, S., & Rockwell, G. (2015) #GamerGate: Distant Reading Games Discourse. Paper presented at the CGSA 2015 conference at the HSSFC Congress at University of Ottawa, Ottawa, Ontario, June 2015. Rockwell, G. (2015). Appendix 1: Ethics of Twitter Gamergate Research. Rockwell, Geoffrey; Suomela, Todd, 2015, "Gamergate Reactions", http://dx.doi.org/10.7939/DVN/10253 V5 [Version]. 7

  8. 8 Lecture 2 8

  9. 9 Tools directly for sentiment analysis SentiStrength (sentistrength.wlv.ac.uk) TheySay (apidemo.theysay.io) Sentic (sentic.net/demo) Sentdex (sentdex.com) Lexalytics (lexalytics.com) Sentilo (wit.istc.cnr.it/stlab-tools/sentilo) nlp.stanford.edu/sentiment 9

  10. 10 Lexicons Bing Liu s opinion lexicon http://www.cs.uic.edu/~liub/FBS/sentiment- analysis.html MPQA subjectivity lexicon http://www.cs.pitt.edu/mpqa/ SentiWordNet Project homepage: http://sentiwordnet.isti.cnr.it Python/NLTK interface: http://compprag.christopherpotts.net/wordnet.html Harvard General Inquirer http://www.wjh.harvard.edu/~inquirer/ Disagree on some-to-many words (see Potts, 2013) SenticNet http://sentic.net

  11. 11 (Some) datasets From Potts (2013), p.5 More on Twitter datasets, including critical appraisal: Saif et al. (2013)

  12. 12 From Tsytsarau & Palpanas (2012) More data sets 12

  13. 13 More datasets SNAP review datasets: http://snap.stanford.edu/data/ Yelp dataset: http://www.yelp.com/dataset_challenge/ User intentions in image capturing a dataset going beyond text Contributed by Desara Xhura thanks! http://www.itec.uni- klu.ac.at/~mlux/wiki/doku.php?id=research:photoint entionsdata Papers on this project: http://www.itec.uni- klu.ac.at/~mlux/wiki/doku.php?id=start 13

  14. 14 Surveys used for this presentation Ronen Feldman: Techniques and applications for sentiment analysis. Commun. ACM 56(4): 82-89 (2013). Bing Liu, Lei Zhang: A Survey of Opinion Mining and Sentiment Analysis. Mining Text Data 2012: 415-463. Bo Pang, Lillian Lee: Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval 2(1-2): 1-135 (2007). Potts (2013). Introduction to Sentiment Analysis. http://www.stanford.edu/class/cs224u/slides/2013/cs224u-slides-02-26.pdf Mikalai Tsytsarau, Themis Palpanas: Survey on mining subjective data on the web. Data Min. Knowl. Discov. 24(3): 478-514 (2012) My summary of these (an earlier and longer version of the present slides): Berendt, B. (2014). Opinion mining, sentiment analysis, and beyond. Lecture at the Summer School Foundations and Applications of Social Network Analysis & Mining, June 2-6, 2014, Athens, Greece. http://people.cs.kuleuven.be/~bettina.berendt/Talks/berendt_opini on_mining_summerschool_2014.pptx 14

  15. 15 Other references Carenini, G., R. Ng, and E. Zwart. Extracting knowledge from evaluative text. In Proceedings of Third Intl. Conf. on Knowledge Capture (K-CAP-05), 2005. Ding, X. and B. Liu. Resolving object and attribute coreference in opinion mining. In Proceedings of International Conference on Computational Linguistics (COLING-2010), 2010. Reforgiato Recupero, D., Presutti, V., Consoli, S., Gangemi, A., & Nuzzolese, A.G. (2014). Sentilo: Frame-based Sentiment Analysis. Cognitive Computation, 7(2):211-225. Gangemi, A., Presutti, V., & Reforgiato Recupero, D. (2014). Frame-Based Detection of Opinion Holders and Topics: A Model and a Tool. IEEE Comp. Int. Mag. 9(1): 20-30. Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM '08). ACM, New York, NY, USA, 219-230. R. Mihalcea, C. Banea, and J. Wiebe, Learning multilingual subjective language via cross-lingual projections, in Proceedings of the Association for Computational Linguistics (ACL), pp. 976 983, Prague, Czech Republic, June 2007. Mihalcea, R. & Liu, H. (2006). A Corpus-based Approach to Finding Happiness In Proc. AAAI Spring Symposium CAAW. http://www.cse.unt.edu/~rada/papers/mihalcea.aaaiss06.pdf Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (HLT '11), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 309-319. Popescu, A. and O. Etzioni. Extracting product features and opinions from reviews. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-2005), 2005. Qiu, G., B. Liu, J. Bu, and C. Chen. Expanding domain sentiment lexicon through double propagation. In Proceedings of International Joint Conference on Articial Intelligence (IJCAI-2009), 2009. Qiu, G., B. Liu, J. Bu, and C. Chen. Opinion word expansion and target extraction through double propagation. Computational Linguistics, 2011. E. Riloff and J. Wiebe, Learning extraction patterns for subjective expressions, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2003. Saif, H., Fernandez, M., He, Y. and Alani, H. (2013) Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold, Workshop: Emotion and Sentiment in Social and Expressive Media: approaches and perspectives from AI (ESSEM) at AI*IA Conference, Turin, Italy. Saif, H., Fernandez, M., He, Y. and Alani, H. (2014) SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twitter, 11th Extended Semantic Web Conference, Crete, Greece. Tan, C., Lee, L., Tang, J., Jiang, L., Zhou, M., & Li, P. (2011). User-level sentiment analysis incorporating social networks. In Proc. 17th SIGKDD Conference (1397-1405). San Diego, CA: ACM Digital Library. Thelwall, M. (2013). Heart and Soul: Sentiment Strength Detection in the Social Web with Sentistrength. In J. Holyst (Ed.), Cyberemotions (pp. 1 14). http://sentistrength.wlv.ac.uk/documentation/SentiStrengthChapter.pdf J. M. Wiebe, T. Wilson, R. Bruce, M. Bell, and M. Martin, Learning subjective language, Computational Linguistics, vol. 30, pp. 277 308, September 2004. H. Yu and V. Hatzivassiloglou, Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences, inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2003. 15

  16. 16 Lecture 3 16

  17. 17 References Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired 16.07. Available at http://edge.org/3rd_culture/anderson08/anderson08_index.html Berendt, B. (2015). Big Capta, Bad Science? On two recent books on Big Data and its revolutionary potential. http://people.cs.kuleuven.be/~bettina.berendt/Reviews/BigData.pdf Berendt, B., B chler, M., & Rockwell, G. (2015). Is it research or is it spying? Thinking-through ethics in Big Data AI and other knowledge sciences. K nstliche Intelligenz, 29(2), 223-232. boyd, d. & Crawford, K. (2012). Critical questions for Big Data. Information, Communication & Society, 15:5, 662-679, DOI: 10.1080/1369118X.2012.678878. De Wolf, R., Vanderhoven, E., Berendt, B., Pierson, J. & Schellens, T. (submitted). Self-reflection in privacy research on social network sites. Kitchin, R. (2014a). The Data Revolution. Big Data, Open Data, Data Infrastructures & Their Consequences. London: Sage. Kitchin, R. (2014b). Big Data, new epistemologies and paradigm shifts. Big Data & Society, April-June 2014,1-12. Kramer, A., Guillory, J., & Hancock, J. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences 111, 8788-8790. http://www.pnas.org/content/111/24/8788.full.pdf+html Moretti, F. (2005). Graphs, Maps, Trees. Abstract Models for Literary History. p.30 London: Verso (cited from the paperback published in 2007) Pauen, M. & Welzer, H. (2015). Autonomie: Eine Verteidigung [Autonomy: A Defence], Frankfurt am Main: S. Fischer Verlag Tufekci, Z. (2014). What Happens to #Ferguson Affects Ferguson: Net Neutrality, Algorithmic Filtering and Ferguson. https://medium.com/message/ferguson-is-also-a-net-neutrality-issue-6d2f3db51eb0 17

Related


More Related Content