Statistical Text Analysis Techniques Overview

Recap: maximum likelihood estimation

CS@UVa

Set partial derivatives to zero

ML estimate

Requirement from probability

CS 6501: Text Mining

Recap: illustration of N-gram language

model smoothing

CS@UVa

CS 6501: Text Mining

Recap: perplexity

•

The inverse of the likelihood of the test set as

assigned by the language model, normalized

by the number of words

CS@UVa

CS 6501: Text Mining

N-gram language model

Latent Semantic Analysis

Hongning Wang

CS@UVa

VS model in practice

•

Document and query are represented by

term

vectors

–

Terms are not necessarily

orthogonal

 to each

other

•

Synonymy: car v.s. automobile

•

Polysemy: fly (action v.s. insect)

CS@UVa

CS6501: Text Mining

Choosing basis for VS model

•

A concept space is preferred

–

Semantic gap will be bridged

CS@UVa

CS6501: Text Mining

How to build such a space

•

Automatic term expansion

–

Construction of thesaurus

•

WordNet

–

Clustering of words

•

Word sense disambiguation

–

Dictionary-based

•

Relation between a pair of words should be similar as in

text and dictionary’s description

–

Explore word usage context

CS@UVa

CS6501: Text Mining

How to build such a space

•

Latent Semantic Analysis

–

Assumption: there is some underlying latent

semantic structure in the data that is partially

obscured by the randomness of word choice with

respect to text generation

–

It means: the observed term-document

association data is contaminated by random noise

CS@UVa

CS6501: Text Mining

How to build such a space

•

Solution

–

Low rank matrix approximation

CS@UVa

CS6501: Text Mining

Latent Semantic Analysis (LSA)

CS@UVa

CS6501: Text Mining

Basic concepts in linear algebra

CS@UVa

CS6501: Text Mining

Basic concepts in linear algebra

CS@UVa

CS6501: Text Mining

Basic concepts in linear algebra

CS@UVa

CS6501: Text Mining

Latent Semantic Analysis (LSA)

CS@UVa

CS6501: Text Mining

Latent Semantic Analysis (LSA)

CS@UVa

CS6501: Text Mining

Geometric interpretation of LSA

CS@UVa

CS6501: Text Mining

Latent Semantic Analysis (LSA)

•

Visualization

CS@UVa

CS6501: Text Mining

What are those dimensions in LSA

•

Principle component analysis

CS@UVa

CS6501: Text Mining

Latent Semantic Analysis (LSA)

•

What we have achieved via LSA

–

Terms/documents that are closely associated are

placed near one another in this new space

–

Terms that do not occur in a document may still

close to it, if that is consistent with the major

patterns of association in the data

–

A good choice of concept space for VS model!

CS@UVa

CS6501: Text Mining

LSA for retrieval

CS@UVa

CS6501: Text Mining

LSA for retrieval

CS@UVa

CS6501: Text Mining

Discussions

CS@UVa

CS6501: Text Mining

LSA beyond text

•

Collaborative filtering

CS@UVa

CS6501: Text Mining

LSA beyond text

•

Eigen face

CS@UVa

CS6501: Text Mining

LSA beyond text

•

Cat from deep neuron network

One of the neurons in the artificial neural network,

trained from still frames from unlabeled YouTube

videos, learned to detect cats.

CS@UVa

CS6501: Text Mining

What you should know

•

Assumptions in LSA

•

Interpretation of LSA

–

Low rank matrix approximation

–

Eigen-decomposition of co-occurrence matrix for

documents and terms

CS@UVa

CS6501: Text Mining

Today’s reading

•

Introduction to information retrieval

–

Chapter 13: Matrix decompositions and latent

semantic indexing

•

Deerwester, Scott C., et al. "

Indexing by latent

semantic analysis

."

JAsIs

 41.6 (1990): 391-407.

CS@UVa

CS6501: Text Mining

Happy Lunar New Year!

CS@UVa

CS6501: Text Mining

Slide Note

Embed Share

Download

The content covers key concepts in statistical text analysis, including maximum likelihood estimation, N-gram language model smoothing, and perplexity calculation. It then delves into Latent Semantic Analysis and the practical application of vector space models, highlighting considerations like synonymy and polysemy. The discussion extends to building concept spaces for text mining through methods like automatic term expansion and word sense disambiguation.

maes810 Follow

Uploaded on Oct 03, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Recap: maximum likelihood estimation Data: a collection of words, ?1,?2, ,?? Model: multinomial distribution p(?) with parameters ??= ? ??, i.e., unigram language model Maximum likelihood estimator: ? = ????????(?|?) ? ?(??) ? ? ? ?(??) log? ? ? = ? ?? log?? ? ? ? = ? ?1, ,?(??) ?? ?? ?=1 ?=1 ?=1 ? ? Using Lagrange multiplier approach, we ll tune ?? to maximize ?(?,?) ? ?,? = ? ?? log??+ ? ?? 1 ?=1 ?=1 ?? ??? =? ?? + ? ??= ? ?? Set partial derivatives to zero ?? ? ? ? Since we have ?=1 ? ?? ?=1 ? = ? ?? ??=1 Requirement from probability ?=1 ??= ML estimate ? CS@UVa CS 6501: Text Mining 1 ? ??

Recap: illustration of N-gram language model smoothing ? ???? 1, ,?? ?+1 Max. Likelihood Estimate ? ???? 1, ,?? ?+1 =?(??,?? 1, ,?? ?+1) ?(?? 1, ,?? ?+1) Smoothed LM w Assigning nonzero probabilities to the unseen words Discount from the seen words CS@UVa CS 6501: Text Mining 2

Recap: perplexity The inverse of the likelihood of the test set as assigned by the language model, normalized by the number of words 1 ? ?? ?1, ,?? = ? ?=1 ?(??|?? 1, ,?? ?+1) N-gram language model CS@UVa CS 6501: Text Mining 3

Latent Semantic Analysis Hongning Wang CS@UVa

VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy: fly (action v.s. insect) CS@UVa CS6501: Text Mining 5

Choosing basis for VS model A concept space is preferred Semantic gap will be bridged Finance D2 D4 D3 Sports Query D1 D5 Education CS@UVa CS6501: Text Mining 6

How to build such a space Automatic term expansion Construction of thesaurus WordNet Clustering of words Word sense disambiguation Dictionary-based Relation between a pair of words should be similar as in text and dictionary s description Explore word usage context CS@UVa CS6501: Text Mining 7

How to build such a space Latent Semantic Analysis Assumption: there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with respect to text generation It means: the observed term-document association data is contaminated by random noise CS@UVa CS6501: Text Mining 8

How to build such a space Solution Low rank matrix approximation Imagine this is *true* concept-document matrix Imagine this is our observed term-document matrix Random noise over the word selection in each document CS@UVa CS6501: Text Mining 9

Latent Semantic Analysis (LSA) Low rank approximation of term-document matrix ?? ? Goal: remove noise in the observed term- document association data Solution: find a matrix with rank ? which is closest to the original matrix in terms of Frobenius norm ? = argmin ?|???? ? =? ? ?? 2 ? ?=1 ? ?=1 = argmin ?|???? ? =? ??? ??? CS@UVa CS6501: Text Mining 10

Basic concepts in linear algebra Symmetric matrix ? = ?? Rank of a matrix The number of linearly independent rows (columns) in a matrix ?? ? ???? ?? ? min(?,?) CS@UVa CS6501: Text Mining 11

Basic concepts in linear algebra Eigen system For a square matrix ?? ? If ?? = ??, ? is called the right eigenvector of ? and ? is the corresponding eigenvalue For a symmetric full-rank matrix ?? ? We have its eigen-decomposition as ? = ? ?? where the columns of ? are the orthogonal and normalized eigenvectors of ? and is a diagonal matrix whose entries are the eigenvalues of ? CS@UVa CS6501: Text Mining 12

Basic concepts in linear algebra Singular value decomposition (SVD) For matrix ?? ? with rank ?, we have ? = ? ?? where ?? ? and ?? ? are orthogonal matrices, and is a ? ? diagonal matrix, with ??= are the eigenvalues of ??? We define ?? ? = ?? ? ? ??? ? where we place ?? in a descending order and set ??= ?? for ? ?, and ??= 0 for ? > ? ?? and ?1 ?? ? ? CS@UVa CS6501: Text Mining 13

Latent Semantic Analysis (LSA) Solve LSA by SVD Map to a lower dimensional space ? = argmin ?|???? ? =? ? ?? 2 ? ?=1 ? ?=1 = argmin ?|???? ? =? ? ??? ??? = ?? ? Procedure of LSA 1. Perform SVD on document-term adjacency matrix 2. Construct ?? ? by only keeping the largest ? singular values in non-zero ? CS@UVa CS6501: Text Mining 14

Latent Semantic Analysis (LSA) Another interpretation ?? ? is the document-term adjacency matrix ?? ?= ?? ? ?? ? ???: document-document similarity by counting how many terms co-occur in ?? and ?? ? = ? ?? ? ?? ?= ? 2?? Eigen-decomposition of document-document similarity matrix ?? In the lower dimensional space, we will only use the first ? elements in ? ? to represent ?? The same analysis applies to ?? ?= ?? ? ? s new representation is then ? ? in this system(space) ? ?? ? CS@UVa CS6501: Text Mining 15

Geometric interpretation of LSA ? ?? ? ?? and ??in the ?-dimensional space Therefore As ?? ? = ?? ? ? ??? ? (i,j) measures the relatedness between ? ? 1 2 ?? is represented as ?? ? ? ? ? 1 2 ?? is represented as ?? ? ? ? ? CS@UVa CS6501: Text Mining 16

Latent Semantic Analysis (LSA) HCI Visualization Graph theory CS@UVa CS6501: Text Mining 17

What are those dimensions in LSA Principle component analysis CS@UVa CS6501: Text Mining 18

Latent Semantic Analysis (LSA) What we have achieved via LSA Terms/documents that are closely associated are placed near one another in this new space Terms that do not occur in a document may still close to it, if that is consistent with the major patterns of association in the data A good choice of concept space for VS model! CS@UVa CS6501: Text Mining 19

LSA for retrieval Project queries into the new document space ? = ??? ? ? ? Treat query as a pseudo document of term vector Cosine similarity between query and documents in this lower-dimensional space 1 CS@UVa CS6501: Text Mining 20

LSA for retrieval HCI Graph theory q: human computer interaction CS@UVa CS6501: Text Mining 21

Discussions Computationally expensive Time complexity ?(??2) Optimal choice of ? Difficult to handle dynamic corpus Difficult to interpret the decomposition results We will come back to this later! CS@UVa CS6501: Text Mining 22

LSA beyond text Collaborative filtering CS@UVa CS6501: Text Mining 23

LSA beyond text Eigen face CS@UVa CS6501: Text Mining 24

LSA beyond text Cat from deep neuron network One of the neurons in the artificial neural network, trained from still frames from unlabeled YouTube videos, learned to detect cats. CS@UVa CS6501: Text Mining 25

What you should know Assumptions in LSA Interpretation of LSA Low rank matrix approximation Eigen-decomposition of co-occurrence matrix for documents and terms CS@UVa CS6501: Text Mining 26

Todays reading Introduction to information retrieval Chapter 13: Matrix decompositions and latent semantic indexing Deerwester, Scott C., et al. "Indexing by latent semantic analysis." JAsIs 41.6 (1990): 391-407. CS@UVa CS6501: Text Mining 27

Happy Lunar New Year! CS@UVa CS6501: Text Mining 28

Statistical Text Analysis Techniques Overview

Download Presentation

Presentation Transcript

Related

More Related Content