STATISTICAL TOPIC MODELING

STATISTICAL TOPIC MODELING
Slide Note
Embed
Share

Dive into the world of statistical topic modeling with a focus on the research by Andrea Tagarelli at the University of Calabria, Italy. Explore the innovative techniques and applications in this field of study, shedding light on the evolving landscape of data analysis and interpretation.

  • Statistical Topic Modeling
  • Research
  • Andrea Tagarelli
  • Data Analysis
  • Machine Learning

Uploaded on Feb 26, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. STATISTICAL TOPIC MODELING part 1 Andrea Tagarelli Univ. of Calabria, Italy

  2. Statistical topic modeling (1/3) Key assumption: text data represented as a mixture of topics, i.e., probability distributions over terms Generative model for documents: document features as being generated by latent variables Topic modeling vs. vector-space text modeling (Latent) Semantic aspects underlying correlations between words Document topical structure

  3. Statistical topic modeling (2/3) Training on (large) corpus to learn: Per-topic word distributions Per-document topic distributions [Blei, CACM, 2012]

  4. Statistical topic modeling (3/3) [Hofmann, SIGIR, 1999] Graphical Plate notation Standard representation for generative models Rectangles (plates) represent repeated areas of the model number of times the variable(s) is repeated

  5. Observed and latent variables Observed variable: we know the current value Latent variable: a variable whose state cannot be observed Estimation problem: Estimate values for a set of distribution parameters that can best explain a set of observations Most likely values of parameters: maximum likelihood of a model Likelihood impossible to calculate in full Approximation through Expectation-maximization algorithm: an iterative method to estimate the probability of unobserved, latent variables. Until a local optimum is obtained Gibbs sampling: update parameters sample-wise Variational inference: approximate the model by a simpler one

  6. Probabilistic LSA PLSA [Hofmann, 2001] Probabilistic version of LSA conceived to better handling problems of term polysemy z w d M N

  7. PLSA training (1/2) Joint probability model: Likelihood

  8. PLSA training (2/2) Training with EM: Initialization of the per-topic word distributions and per-document topic distributions E-step: M-step:

  9. Latent Dirichlet Allocation (1/2) LDA [Blei et al., 2003] Adds a Dirichlet prior on the per-document topic distribution 3-level scheme: corpus, documents, and terms Terms are the only observed variables Per-topic word distribution Topic assignment to a word at position i in doc dj For each word position in a doc of length M For each doc in a collection of N docs Per-document topic distribution Word token at position i in doc dj [Moens and Vulic, Tutorial @WSDM 2014]

  10. Latent Dirichlet Allocation (2/2) Meaning of Dirichlet priors ~ Dir( 1, , K) Each kis a prior observation count for the no. of times a topic zk is sampled in a document prior to word observations Analogously for i, with ~ Dir( 1, , V) Inference for a new document: Given , , , infer Exact inference problem is intractable: training through Gibbs sampling Variational inference

Related


More Related Content