STATISTICAL TOPIC MODELING

Slide Note

Dive into the world of statistical topic modeling with a focus on the research by Andrea Tagarelli at the University of Calabria, Italy. Explore the innovative techniques and applications in this field of study, shedding light on the evolving landscape of data analysis and interpretation.

maye_l Follow

Uploaded on Feb 26, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

STATISTICAL TOPIC MODELING part 1 Andrea Tagarelli Univ. of Calabria, Italy

Statistical topic modeling (1/3) Key assumption: text data represented as a mixture of topics, i.e., probability distributions over terms Generative model for documents: document features as being generated by latent variables Topic modeling vs. vector-space text modeling (Latent) Semantic aspects underlying correlations between words Document topical structure

Statistical topic modeling (2/3) Training on (large) corpus to learn: Per-topic word distributions Per-document topic distributions [Blei, CACM, 2012]

Statistical topic modeling (3/3) [Hofmann, SIGIR, 1999] Graphical Plate notation Standard representation for generative models Rectangles (plates) represent repeated areas of the model number of times the variable(s) is repeated

Observed and latent variables Observed variable: we know the current value Latent variable: a variable whose state cannot be observed Estimation problem: Estimate values for a set of distribution parameters that can best explain a set of observations Most likely values of parameters: maximum likelihood of a model Likelihood impossible to calculate in full Approximation through Expectation-maximization algorithm: an iterative method to estimate the probability of unobserved, latent variables. Until a local optimum is obtained Gibbs sampling: update parameters sample-wise Variational inference: approximate the model by a simpler one

Probabilistic LSA PLSA [Hofmann, 2001] Probabilistic version of LSA conceived to better handling problems of term polysemy z w d M N

PLSA training (1/2) Joint probability model: Likelihood

PLSA training (2/2) Training with EM: Initialization of the per-topic word distributions and per-document topic distributions E-step: M-step:

Latent Dirichlet Allocation (1/2) LDA [Blei et al., 2003] Adds a Dirichlet prior on the per-document topic distribution 3-level scheme: corpus, documents, and terms Terms are the only observed variables Per-topic word distribution Topic assignment to a word at position i in doc dj For each word position in a doc of length M For each doc in a collection of N docs Per-document topic distribution Word token at position i in doc dj [Moens and Vulic, Tutorial @WSDM 2014]

Latent Dirichlet Allocation (2/2) Meaning of Dirichlet priors ~ Dir( 1, , K) Each kis a prior observation count for the no. of times a topic zk is sampled in a document prior to word observations Analogously for i, with ~ Dir( 1, , V) Inference for a new document: Given , , , infer Exact inference problem is intractable: training through Gibbs sampling Variational inference

STATISTICAL TOPIC MODELING

Download Presentation

Presentation Transcript

Related

More Related Content