Text Mining Methods and Applications

undefined
 
Text Mining
 
Week 7 Video 1
 
Text Mining
 
A big disjunction in methods and what is easy and
hard to do
 
Pre-LLM
LLM
 
Pre-LLM methods still useful for a lot of things, and
that’s the topic of this lecture
 
Pre-LLM Text Mining
 
Very different from the types of interaction data
and course data I’ve discussed throughout the rest
of the class
 
This lecture only skims the 
very
 
surface of this huge
topic
Different Stuff Works
 
Stuff that works poorly in interaction data works
better in text mining
Support Vector Machines
 
Stuff that works great in interaction data is less
relevant in text mining
Bayesian Knowledge Tracing, IRT
Interesting Attributes of Textual Data
 
Really high dimensionality
Many many words in a corpus of data
LLMs address differences in words through vector
embeddings (but that’s not today)
 
Multiple levels of analysis that look very different
from each other
From individual phonemes and graphemes to entire
books
Analyses (with these methods)
often conducted
 
At level of whether individual words are seen
 
A popular algorithm for this is Latent Semantic Analysis
(LSA)
Represents utterances or paragraphs as rows
And each column is a word that can be present (1) or
absent (0)
Conducts singular value decomposition (a matrix
factorization algorithm conceptually similar to factor
analysis and for that matter NNMF) to find structure
Does not look at syntax of sentences, just what words are
present (Landauer, Foltz, & Laham, 1998)
Does consider co-occurrence of words across large corpuses
Alternatively, analysis is conducted
using
 
Pairs of words, in order, called 
bigrams
Triplets of words, in order, called 
trigrams
 
“Colorless green ideas sleep furiously”
Bigrams: “Colorless green”, “green ideas”, “ideas
sleep”, “sleep furiously”
Trigrams: “Colorless green ideas”, “green ideas
sleep”, “ideas sleep furiously”
Semantic Tagging
 
Another approach is to reduce specific words to
semantic categories, such as sports, business, time,
prior to analysis
 
Allows easier categorization of types of utterances
that is less dependent on presence of specific words
 
LIWC
WMatrix
Coherence
 
Another type of tool can provide coherence metrics
 
Updated version of reading level metrics such as
Fleisch-Kincaid
 
How hard is a text to read?
 
Coh-Metrix (Graesser, McNamara, &
Kulikowich, 2011)
TAACO (Crossley, Kyle, & McNamara, 2016)
Lexical Sophistication
 
How complex/advanced a text and the words in it
are
 
TAALES (Kyle & Crossley, 2015)
Includes many other measures as well, including
measures of cohesion
Syntactic Complexity
 
How complex/advanced the grammatical features
in a text are
 
L2 Syntactic Complexity Analyzer (Lu, 2010)
Sentiment Analysis
 
Assessing emotion or attitude, typically in terms of
positive/negative
 
Many, many tools for this
The aforementioned LIWC
In education, see SEANCE (Crossley, Kyle, &
McNamara, 2016)
 
Next lecture
 
On to LLMs!
Slide Note
Embed
Share

Text mining is a complex field that presents a significant disjunction in methods compared to other data analysis approaches. Pre-LLM methods are still relevant in various applications, showcasing different performance in text mining tasks. The analysis involves examining individual words' presence, where algorithms like Latent Semantic Analysis play a crucial role in uncovering word relationships. Additionally, the analysis can also be conducted using word pairs (bigrams) and triplets (trigrams) for deeper insights into text data structures and patterns.

  • Text Mining
  • Data Analysis
  • Textual Data
  • Machine Learning
  • Natural Language Processing

Uploaded on Oct 01, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Week 7 Video 1 Text Mining

  2. Text Mining A big disjunction in methods and what is easy and hard to do Pre-LLM LLM Pre-LLM methods still useful for a lot of things, and that s the topic of this lecture

  3. Pre-LLM Text Mining Very different from the types of interaction data and course data I ve discussed throughout the rest of the class This lecture only skims the very surface of this huge topic

  4. Different Stuff Works Stuff that works poorly in interaction data works better in text mining Support Vector Machines Stuff that works great in interaction data is less relevant in text mining Bayesian Knowledge Tracing, IRT

  5. Interesting Attributes of Textual Data Really high dimensionality Many many words in a corpus of data LLMs address differences in words through vector embeddings (but that s not today) Multiple levels of analysis that look very different from each other From individual phonemes and graphemes to entire books

  6. Analyses (with these methods) often conducted At level of whether individual words are seen A popular algorithm for this is Latent Semantic Analysis (LSA) Represents utterances or paragraphs as rows And each column is a word that can be present (1) or absent (0) Conducts singular value decomposition (a matrix factorization algorithm conceptually similar to factor analysis and for that matter NNMF) to find structure Does not look at syntax of sentences, just what words are present (Landauer, Foltz, & Laham, 1998) Does consider co-occurrence of words across large corpuses

  7. Alternatively, analysis is conducted using Pairs of words, in order, called bigrams Triplets of words, in order, called trigrams Colorless green ideas sleep furiously Bigrams: Colorless green , green ideas , ideas sleep , sleep furiously Trigrams: Colorless green ideas , green ideas sleep , ideas sleep furiously

  8. Semantic Tagging Another approach is to reduce specific words to semantic categories, such as sports, business, time, prior to analysis Allows easier categorization of types of utterances that is less dependent on presence of specific words LIWC WMatrix

  9. Coherence Another type of tool can provide coherence metrics Updated version of reading level metrics such as Fleisch-Kincaid How hard is a text to read? Coh-Metrix (Graesser, McNamara, & Kulikowich, 2011) TAACO (Crossley, Kyle, & McNamara, 2016)

  10. Lexical Sophistication How complex/advanced a text and the words in it are TAALES (Kyle & Crossley, 2015) Includes many other measures as well, including measures of cohesion

  11. Syntactic Complexity How complex/advanced the grammatical features in a text are L2 Syntactic Complexity Analyzer (Lu, 2010)

  12. Sentiment Analysis Assessing emotion or attitude, typically in terms of positive/negative Many, many tools for this The aforementioned LIWC In education, see SEANCE (Crossley, Kyle, & McNamara, 2016)

  13. Next lecture On to LLMs!

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#