Exploring Text Mining Methods and Applications

Slide Note
Embed
Share

Text mining is a complex field that presents a significant disjunction in methods compared to other data analysis approaches. Pre-LLM methods are still relevant in various applications, showcasing different performance in text mining tasks. The analysis involves examining individual words' presence, where algorithms like Latent Semantic Analysis play a crucial role in uncovering word relationships. Additionally, the analysis can also be conducted using word pairs (bigrams) and triplets (trigrams) for deeper insights into text data structures and patterns.


Uploaded on Oct 01, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Week 7 Video 1 Text Mining

  2. Text Mining A big disjunction in methods and what is easy and hard to do Pre-LLM LLM Pre-LLM methods still useful for a lot of things, and that s the topic of this lecture

  3. Pre-LLM Text Mining Very different from the types of interaction data and course data I ve discussed throughout the rest of the class This lecture only skims the very surface of this huge topic

  4. Different Stuff Works Stuff that works poorly in interaction data works better in text mining Support Vector Machines Stuff that works great in interaction data is less relevant in text mining Bayesian Knowledge Tracing, IRT

  5. Interesting Attributes of Textual Data Really high dimensionality Many many words in a corpus of data LLMs address differences in words through vector embeddings (but that s not today) Multiple levels of analysis that look very different from each other From individual phonemes and graphemes to entire books

  6. Analyses (with these methods) often conducted At level of whether individual words are seen A popular algorithm for this is Latent Semantic Analysis (LSA) Represents utterances or paragraphs as rows And each column is a word that can be present (1) or absent (0) Conducts singular value decomposition (a matrix factorization algorithm conceptually similar to factor analysis and for that matter NNMF) to find structure Does not look at syntax of sentences, just what words are present (Landauer, Foltz, & Laham, 1998) Does consider co-occurrence of words across large corpuses

  7. Alternatively, analysis is conducted using Pairs of words, in order, called bigrams Triplets of words, in order, called trigrams Colorless green ideas sleep furiously Bigrams: Colorless green , green ideas , ideas sleep , sleep furiously Trigrams: Colorless green ideas , green ideas sleep , ideas sleep furiously

  8. Semantic Tagging Another approach is to reduce specific words to semantic categories, such as sports, business, time, prior to analysis Allows easier categorization of types of utterances that is less dependent on presence of specific words LIWC WMatrix

  9. Coherence Another type of tool can provide coherence metrics Updated version of reading level metrics such as Fleisch-Kincaid How hard is a text to read? Coh-Metrix (Graesser, McNamara, & Kulikowich, 2011) TAACO (Crossley, Kyle, & McNamara, 2016)

  10. Lexical Sophistication How complex/advanced a text and the words in it are TAALES (Kyle & Crossley, 2015) Includes many other measures as well, including measures of cohesion

  11. Syntactic Complexity How complex/advanced the grammatical features in a text are L2 Syntactic Complexity Analyzer (Lu, 2010)

  12. Sentiment Analysis Assessing emotion or attitude, typically in terms of positive/negative Many, many tools for this The aforementioned LIWC In education, see SEANCE (Crossley, Kyle, & McNamara, 2016)

  13. Next lecture On to LLMs!

Related


More Related Content