
PLSA Method for Time-Stamped Text Document Mining
Explore the topic mining using the Probabilistic Latent Semantic Analysis (PLSA) method, as presented by Miss Deepali P. Shelke. Learn about the motivation, process, and significance of text mining, aiming to move beyond simple document retrieval towards knowledge discovery in unstructured data formats. Dive into the objectives and related work, including the TDT-Topic Detection Tracking model.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Topic Mining For Time stamped Text document using PLSA Method Presented by Miss.Deepali P.Shelke. M. E. Computer Engineering (2011-12) K.K.Wagh COE, Research center, Nasik. University of Pune. Guided by Prof. Nitin M.Shahane (K.K.Wagh College of Engineering,Nasik) (University of Pune.) KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 1
Agenda Introduction & Motivation Related Work Proposed System Experimental Results & Discussion Contribution Conclusion Future Scope References KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 2
Search versus Discover(Introduction) Search Discover (opportunistic) (goal-oriented) Structured Data Data Retrieval Data Mining Unstructured Data (Text) Information Retrieval Text Mining KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 3
Text Mining process (Introduction) KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 4
Motivation for Text Mining (Introduction) Approximately 90%of the world s data is held in unstructured formats (source: Oracle Corporation) Information intensive business processes demand that we transcend from simple document retrieval to knowledge discovery. Structured Numerical or Coded Information 10% Unstructured or Semi-structured Information 90% KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 5
KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 6
Objectives To find c(w,d) and term-document matrix. To calculate the p(z),p(z/d),p(z/w)and p(w,d) To calculate loglikelihood function. = c(w,d) log p(w,d) To analyze EM algorithm. To calculate normalization step,until the value is less than threashold value. To calculate the p(z),p(z/d),p(z/w)and p(w,d) again for each word w in every document d. To find common topics k. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 7
Related Work TDT-Topic Detection Tracking (Ref 13).:-by j.Allan in 1998 It consist of two tasks: Retrospective detection Online detection Limitations: Both form of event detection has lack of mining the data stream in new pattern in document content. Lack of a intentionally advance knowledge of novel events. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 8
LDA (Related Work) LDA-Latent Dricklet Allocation (Ref 15)-It is an example of topic model and was first graphical model for topic discovery by Devid Blei, Andrew in 2003. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 9
LDA (Related Work) LDA-Latent Dricklet Allocation (Ref 15)-It is an example of topic model and was first graphical model for topic discovery by Devid Blei, Andrew in 2003. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 10
CTM (Related Work) CTM-Correlated Topic Model (Ref 16)-by D.M.Blei in 2005. To find correlation between topics. Limitations:- lot of calculation method Having lot of general word inside the topic. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 11
DTM (Related Work) DTM-Dynamic Topic Models (Ref 2)- by D.M.Blei in 2006. which captures the evolution of topics in a sequentially organized corpus of documents. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 12
LSA (Related Work) LSA-Latent Semantic Analysis (Ref 14)-by Dunais s in 1995. Dimensionality Reduction: Construction of latent Space: KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 13
Dimensionality Reduction:(Related Work) Ex: KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 14
LDA Need to manually remove stop words. LDA can not make the representation of relationships among topics. CTM Requires lot of calculation method. Having lot of general words inside the topics. Hard to obtain & to determine the number of topics. It may classify documents together, if they don t have common words. To interpret loading values with probability meaning it is hard to operate it. LSA KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 15
Construction of latent space:(Related Work) Ex: synonymy such as bye-purchase polysemy such as book, d1- I purchase a book. d2- I book movie. It may classify document together if they don t have common words. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 16
Proposed System N-Text Documents (1 .N) input Topic K is given by User Using PLSA Method Pre-processing Vocabulary V M-Step E-Step Top K Topical words Topic K Extraction output
Probabilistic Latent Semantic Analysis Unsupervised technique Two-level generative model: a document is a mixture of topics, and each topic has its own characteristic word distribution z d w document topic P(z|d) word P(w|z) KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 18
Ex-pLSA for images Document = image, topic = class, word = quantized feature z d w face KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 19
Symbols & Their Meanings Symbols Their Meanings z Topic d document w Words N Number Of documents KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 20
Algorithm Preprocessing + PLSA method Common Topics K in clusters Input as N documents KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 21
Algorithm Input to the set of N text documents.(1 .N) Step 1 Using preprocessing step produces vocabulary. Step 2 Using Probabilistic Latent Semantic Analysis technique, to extract co-related words in one Cluster according to their meanings. Step3 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 22
The probability of a word w to be inside a document d is P (w,d ) = z p(z)p(z/d)p(z/w) Step 3.1 Calculate Log-likelihood function over all sequence = w d c(w,d) log p(w,d) Step 3.2 To perform estimation maximization (EM) algorithm. To perform EM step until convergence. Step 4 To produce position of Words. Step 5 Display the common topics (K). Step 6 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 23
Workflow Topic Extraction To find no of occurances of words in every documents. To calculate term by document matrix Ex- [ w1 w2 w3 w4 w5 w6 d0 9 2 1 0 0 0 d1 8 3 2 1 0 0 d2 0 0 3 3 4 8 d3 0 2 0 2 4 7 d4 2 0 1 1 0 3 ] Probability of topic zo=z1=0.5 24 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025
Calculate p(z/d) (workflow) To calculate the probability that topic z present in given document d is p(z/d) by selecting random values. calculation of occurrence table d1 0.46 0.26 d2 0.16 0.72 d3 0.41 0.99 d4 0.32 0.85 d5 0.95 0.46 z0 z1 P(z/d) value d1 0.280 0.o80 d2 0.0695 0.222 d3 0.178 0.305 d4 0.1391 0.25 d5 0.4130 0.1419 4/3/2025 25
Calculate p(z/w) (workflow) To calculate the p(z/w) => Probability of topic z associated with word w W0 0.4121 0.204 w1 0.945 0.1239 w2 0.033 0.227 w3 0.1418 0.061 w4 0.054 0.008 w5 0.263 0.3760 z0 z1 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 26
Calculate log likelihood (workflow) = w d c(w,d) log p(w,d) W0 W1 W2 W3 W4 w5 -10.8756 -14.437 -14.4378 0 0 0 d1 -26.5849 -31.361 -34.4557 -34.455 d2 0 0 -40.63184 -45.287 -51.365 -60.267 d3 0 -63.490 0 -66.7372 -73.0399 -81.4480 d4 -84.3623 0 -86.4703 -88.3814 0 -93.1578 d5 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 27
E-Step (workflow) p(z/d)= w d p(z)p(z/d)p(z/w) for every document d . For p(z,d0) W0 w1 w2 w3 w4 w5 0.86132 0.64817 0.33093 0 0 0 z0 z1 0.13867 0.35158 0.6690 0 0 0 For p(z,d1) W0 w1 w2 w3 w4 w5 z0 0.4589 0.2011 0.0632 0.3959 0 0 z1 0.54106 0.7988 0.9367 0.6040 0 0 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 28
E-Step (workflow) For p(z,d2)W0 w1 w2 w3 w4 w5 0 0 0.08359 0.25722 0.43907 0.1521 z0 z1 0 0 0.9165 0.74278 0.56092 0.84783 For p(z,d3) W0 w1 w2 w3 w4 w5 z0 0 0.40246 0 0.38361 0.58188 0.24388 z1 0 0.59753 0 0.61638 0.41811 0.7561 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 29
E-Step (workflow) For p(z,d4) W0 w1 w2 w3 w4 w5 0.80470 0 0.17496 0.44601 0 0.29441 z0 z1 0.19529 0 0.82503 0.55398 0 0.05581 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 30
M -Step (workflow) p(z/w)= w d c(w,d)p(z/d) W0 w1 w2 w3 w4 w5 z0 0.18808 0.08618 0.104901 0.05580 0.20939 0.355631 0.39274 0.12694 0.10727 0.15887 0.02854 0.185608 z1 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 31
Normalization step p(z/d)= c(w,d)p(z/d) for every document d . For p(z/d0) W0 w1 w2 w3 w4 w5 0.43760 0.51716 0.74245 0 0 0 z0 For (z/d1) W0 w1 w2 w3 w4 w5 z0 0.19461 0.24960 0.47237 0.17882 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 32
Normalization step For p(z/d2) W0 w1 w2 w3 w4 w5 0 0. 0.3739 0.1268 0.8089 0.46790 z0 For p(z/d3) W0 w1 w2 w3 w4 w5 z0 0 0.5657 0 0.96133 0.46028 0.83775 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 33
Normalization step For p(z/d4) W0 w1 w2 w3 w4 w5 0.43261 0. 0.73855 0.4072 0 0.8062 z0 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 34
Final value p(z/w) (workflow) W0 1.0 2.169 w1 0.766 0.2336 w2 0.5559 0.444 w3 0.2009 0.7990 w4 5.7397 1.0 w5 6.3471 1.0 z0 z1 Final Result z0 z1 w0,w1,w2,w3 W1,w2,w3,w4,w5 KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 35
Contribution This system will be implemented for application level such as college portal and checking the sensitivity data. It checks the sensitive information, if the document has sensible information then it avoid these document for analyze. Personalized Search Accuracy Optimization KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 36
Problem Definition Let user U intends to extract common topics k with the help of system S. System S perform preprocessing using set of N documents, i.e. removing stop words. Then system S calculate p(z),p(z/d),p(z/w) and p(w,d) for selecting random values.where N=d1,d2, ..dn documents Now compute loglikelihood function and Estimation maximization step until the value is less than threashold value . To calculate the p(z),p(z/d),p(z/w)and p(w,d) again for each word w in every document d.To shows the result in common topics k for given N documents given by user U. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 37
Conclusion Topic mining is a field of text mining. The problem of mining common topics from different text sequences using novel method Multiple text sequences are often related to each other as they share the common topics among them. We justified the effectiveness of our method on real world data sets, the experimental results suggest that : Our method is able to find meaningful and discriminative topics from different text sequences. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 38
Future Work Contextual Probabilistic Latent Semantic Analysis KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 39
References [1] Xiang Wang, Xiaoming Jin, Meng-En Chen, Kai Zhang, and Dou Shen Topic Mining over Asynchronous Text Sequences IEEE transactions on Knowledge and Data Engineering, vol. 24, no. 1, January 2012, pp.156 -169. [2] J. Allan, R. Papka, and V. Lavrenko, On-Line New Event Detection and Tracking, Proc. Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR), pp. 37- 45, 1998 [3] D.M. Blei and J.D. Lafferty, Dynamic Topic Models, Proc. Int l Conf. Machine Learning (ICML), pp. 113-120, 2006. [[4] G.P.C. Fung, J.X. Yu, P.S. Yu, and H. Lu, Parameter Free Bursty Events Detection in Text Streams, Proc. Int l Conf. Very Large Data Bases (VLDB), pp. 181-192, 2005. [5] J.M. Kleinberg, Bursty and Hierarchical Structure in Streams, Proc. ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (KDD), pp. 91-101, 2002. [6] A. Krause, J. Leskovec, and C. Guestrin, Data Association for Topic Intensity Tracking, Proc. Int l Conf. Machine Learning (ICML), pp. 497-504, 2006. [7] Z. Li, B. Wang, M. Li, and W.-Y. Ma, A Probabilistic Model for Retrospective News Event Detection, Proc. Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR), pp. 106-113, 2005. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 40
Cont [8] Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs, Proc. Int l Conf. World Wide Web (WWW), pp. 533-542, 2006. [9] Q. Mei and C. Zhai, Discovering Evolutionary Theme Patterns from Text: An Exploration of Temporal Text Mining, Proc. ACM 0SIGKDD Int l Conf. Knowledge Discovery and Data Mining (KDD), pp. 198-207, 2005. [10] X. Wang and A. McCallum, Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends, Proc. ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (KDD), pp. 424433, 2006. [11] T.L. Griffiths and M. Steyvers, Finding Scientific Topics, Proc. Nat l Academy of Sciences USA, vol. 101, no. Suppl 1, pp. 5228-5235, 2004. [12].X. Wang, C. Zhai, X. Hu, and R. Sproat, Mining Correlated Bursty Topic Patterns from Coordinated Text Streams, Proc. ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (KDD), pp. 784-793, 2007. [13] T. Hofmann, Probabilistic Latent Semantic Indexing, Proc. Ann.Int l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR), pp. 50-57, 1999. [14] D.M. Blei, A.Y. Ng, and M.I. Jordan, Latent Dirichlet Allocation, Proc. Neural Information Processing Systems, pp. 601-608, 2001. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 41
Thank You. KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 42
Any Questions.?? KK wagh Research center,Nasik Topic mining using PLSA 4/3/2025 43