Introduction to NLP Text Clustering

Introduction to NLP Text Clustering
Slide Note
Embed
Share

In this content, you will explore the concept of text clustering in Natural Language Processing (NLP). The material covers different clustering techniques such as exclusive and overlapping clusters, hierarchical versus flat clusters, and the cluster hypothesis. It elaborates on practical applications and provides examples of clustering methods like k-means. The content also discusses evaluation metrics for clustering effectiveness, including purity and the Rand Index. Additionally, links to demos and resources for further learning are provided.

  • NLP
  • Text Clustering
  • Clustering Techniques
  • k-means
  • Evaluation Metrics

Uploaded on Feb 18, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. NLP

  2. Introduction to NLP Text Clustering

  3. Clustering Exclusive/overlapping clusters Hierarchical/flat clusters The cluster hypothesis Documents in the same cluster are relevant to the same query How do we use it in practice?

  4. Example

  5. k-means Iteratively determine which cluster a point belongs to, then adjust the cluster centroid, then repeat Needed: small number k of desired clusters hard decisions

  6. k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while

  7. Example Cluster the following vectors into two groups: A = <1,6> B = <2,2> C = <4,0> D = <3,3> E = <2,5> F = <2,1>

  8. Demos http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/ AppletKM.html http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_ k-means http://www.cs.washington.edu/research/imagedatabase/dem o/kmcluster http://www.cc.gatech.edu/~dellaert/FrankDellaert/Software.h tml http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf http://web.archive.org/web/20110223234358/http://www.e ce.neu.edu/groups/rpl/projects/kmeans/

  9. Evaluation of Clustering Purity considering the majority class in each cluster RAND index See next slide

  10. Purity Three clusters XXXOO OOOX% %%%%XX Purity: (3+3+4)/16=62.5%

  11. Rand Index Accuracy when preserving object-object relationships RI=(TP+TN)/(TP+FP+FN+TN) In the example: 5 5 6 + = + + = 35 TP FP 2 2 2 3 3 4 2 = + + + = 13 TP 2 2 = 2 2 = 35 13 22 FP

  12. Rand Index Same cluster TP=13 FP=22 Same class FN=21 TN=64 RI = (TP+TN)/(TP+TN+FP+FN)=(13+64)/(13+64+22+21)=0.64

  13. Hierarchical clustering methods Single-linkage One common pair is sufficient disadvantages: long chains Complete-linkage All pairs have to match Disadvantages: too conservative Average-linkage

  14. Hierarchical clustering 1 2 5 6 3 4 7 8

  15. Hierarchical agglomerative clustering Dendrograms E.g., language similarity: http://odur.let.rug.nl/~kleiweg/clustering/clustering.html

  16. Clustering using dendrograms Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation?

  17. NLP

More Related Content