Text Mining Methods and Applications

undefined

Text Mining

Week 7 Video 1

Text Mining



A big disjunction in methods and what is easy and

hard to do



Pre-LLM



LLM



Pre-LLM methods still useful for a lot of things, and

that’s the topic of this lecture

Pre-LLM Text Mining



Very different from the types of interaction data

and course data I’ve discussed throughout the rest

of the class



This lecture only skims the

very

surface of this huge

topic

Different Stuff Works



Stuff that works poorly in interaction data works

better in text mining



Support Vector Machines



Stuff that works great in interaction data is less

relevant in text mining



Bayesian Knowledge Tracing, IRT

Interesting Attributes of Textual Data



Really high dimensionality



Many many words in a corpus of data



LLMs address differences in words through vector

embeddings (but that’s not today)



Multiple levels of analysis that look very different

from each other



From individual phonemes and graphemes to entire

books

Analyses (with these methods)

often conducted



At level of whether individual words are seen



A popular algorithm for this is Latent Semantic Analysis

(LSA)



Represents utterances or paragraphs as rows



And each column is a word that can be present (1) or

absent (0)



Conducts singular value decomposition (a matrix

factorization algorithm conceptually similar to factor

analysis and for that matter NNMF) to find structure



Does not look at syntax of sentences, just what words are

present (Landauer, Foltz, & Laham, 1998)



Does consider co-occurrence of words across large corpuses

Alternatively, analysis is conducted

using



Pairs of words, in order, called

bigrams



Triplets of words, in order, called

trigrams



“Colorless green ideas sleep furiously”



Bigrams: “Colorless green”, “green ideas”, “ideas

sleep”, “sleep furiously”



Trigrams: “Colorless green ideas”, “green ideas

sleep”, “ideas sleep furiously”

Semantic Tagging



Another approach is to reduce specific words to

semantic categories, such as sports, business, time,

prior to analysis



Allows easier categorization of types of utterances

that is less dependent on presence of specific words



LIWC



WMatrix

Coherence



Another type of tool can provide coherence metrics



Updated version of reading level metrics such as

Fleisch-Kincaid



How hard is a text to read?



Coh-Metrix (Graesser, McNamara, &

Kulikowich, 2011)



TAACO (Crossley, Kyle, & McNamara, 2016)

Lexical Sophistication



How complex/advanced a text and the words in it

are



TAALES (Kyle & Crossley, 2015)



Includes many other measures as well, including

measures of cohesion

Syntactic Complexity



How complex/advanced the grammatical features

in a text are



L2 Syntactic Complexity Analyzer (Lu, 2010)

Sentiment Analysis



Assessing emotion or attitude, typically in terms of

positive/negative



Many, many tools for this



The aforementioned LIWC



In education, see SEANCE (Crossley, Kyle, &

McNamara, 2016)

Next lecture



On to LLMs!

Slide Note

Embed Share

Download

Text mining is a complex field that presents a significant disjunction in methods compared to other data analysis approaches. Pre-LLM methods are still relevant in various applications, showcasing different performance in text mining tasks. The analysis involves examining individual words' presence, where algorithms like Latent Semantic Analysis play a crucial role in uncovering word relationships. Additionally, the analysis can also be conducted using word pairs (bigrams) and triplets (trigrams) for deeper insights into text data structures and patterns.

czuk Follow

Uploaded on Oct 01, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Week 7 Video 1 Text Mining

Text Mining A big disjunction in methods and what is easy and hard to do Pre-LLM LLM Pre-LLM methods still useful for a lot of things, and that s the topic of this lecture

Pre-LLM Text Mining Very different from the types of interaction data and course data I ve discussed throughout the rest of the class This lecture only skims the very surface of this huge topic

Different Stuff Works Stuff that works poorly in interaction data works better in text mining Support Vector Machines Stuff that works great in interaction data is less relevant in text mining Bayesian Knowledge Tracing, IRT

Interesting Attributes of Textual Data Really high dimensionality Many many words in a corpus of data LLMs address differences in words through vector embeddings (but that s not today) Multiple levels of analysis that look very different from each other From individual phonemes and graphemes to entire books

Analyses (with these methods) often conducted At level of whether individual words are seen A popular algorithm for this is Latent Semantic Analysis (LSA) Represents utterances or paragraphs as rows And each column is a word that can be present (1) or absent (0) Conducts singular value decomposition (a matrix factorization algorithm conceptually similar to factor analysis and for that matter NNMF) to find structure Does not look at syntax of sentences, just what words are present (Landauer, Foltz, & Laham, 1998) Does consider co-occurrence of words across large corpuses

Alternatively, analysis is conducted using Pairs of words, in order, called bigrams Triplets of words, in order, called trigrams Colorless green ideas sleep furiously Bigrams: Colorless green , green ideas , ideas sleep , sleep furiously Trigrams: Colorless green ideas , green ideas sleep , ideas sleep furiously

Semantic Tagging Another approach is to reduce specific words to semantic categories, such as sports, business, time, prior to analysis Allows easier categorization of types of utterances that is less dependent on presence of specific words LIWC WMatrix

Coherence Another type of tool can provide coherence metrics Updated version of reading level metrics such as Fleisch-Kincaid How hard is a text to read? Coh-Metrix (Graesser, McNamara, & Kulikowich, 2011) TAACO (Crossley, Kyle, & McNamara, 2016)

Lexical Sophistication How complex/advanced a text and the words in it are TAALES (Kyle & Crossley, 2015) Includes many other measures as well, including measures of cohesion

Syntactic Complexity How complex/advanced the grammatical features in a text are L2 Syntactic Complexity Analyzer (Lu, 2010)

Sentiment Analysis Assessing emotion or attitude, typically in terms of positive/negative Many, many tools for this The aforementioned LIWC In education, see SEANCE (Crossley, Kyle, & McNamara, 2016)

Next lecture On to LLMs!

Text Mining Methods and Applications

Download Presentation

Presentation Transcript

Related

More Related Content