Understanding Clustering in Information Retrieval

introduction to information retrieval n.w
1 / 71
Embed
Share

Explore the concept of clustering in information retrieval, the process of grouping documents into clusters of similarity. Learn about the benefits of clustering for search applications, user interface enhancement, and visualization of document collections. Discover the differences between classification and clustering, and the importance of clustering for improving search recall. Dive into algorithms for finding cluster structures and how clustering can enhance search speed and results.

  • Clustering
  • Information Retrieval
  • Search Applications
  • Document Clustering
  • Search Recall

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Introduction to Information Retrieval Lecture 4 : Clustering wyang@ntu.edu.tw Introduction to Information Retrieval Ch 16 & 17 1

  2. Introduction to Information Retrieval Clustering : Introduction 2

  3. Introduction to Information Retrieval Clustering: Definition (Document) clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning. Unsupervised = there are no labeled or annotated data. 3 3

  4. Introduction to Information Retrieval Data set with clear cluster structure Propose algorithm for finding the cluster structure in this example 4 4

  5. Introduction to Information Retrieval Classification vs. Clustering Classification Supervised learning Classes are human-defined and part of the input to the learning algorithm. Clustering Unsupervised learning Clusters are inferred from the data without human input. 5 5

  6. Introduction to Information Retrieval Why cluster documents? Whole corpus analysis/navigation Better user interface For improving recall in search applications Better search results For better navigation of search results Effective user recall will be higher For speeding up vector space retrieval Faster search 6

  7. Introduction to Information Retrieval For visualizing a document collection Wise et al, Visualizing the non-visual PNNL ThemeScapes, Cartia [Mountain height = cluster size] 7

  8. Introduction to Information Retrieval For improving search recall Cluster hypothesis - closely associated documents tend to be relevant to the same requests . Therefore, to improve search recall: Cluster docs in corpus When a query matches a doc D, also return other docs in the cluster containing D Hope if we do this: The query car will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile. Why might this happen? 8

  9. Introduction to Information Retrieval For better navigation of search results For grouping search results thematically clusty.com / Vivisimo (Enterprise Search Velocity) 9

  10. Introduction to Information Retrieval 10

  11. Introduction to Information Retrieval Issues for clustering (1) General goal: put related docs in the same cluster, put unrelated docs in different clusters. Representation for clustering Document representation Need a notion of similarity/distance 11

  12. Introduction to Information Retrieval Issues for clustering (2) How to decide the number of clusters Fixed a priori : assume the number of clusters K is given. Data driven : semiautomatic methods for determining K Avoid very small and very large clusters Define clusters that are easy to explain to the user 12

  13. Introduction to Information Retrieval Clustering Algorithms Flat (Partitional) algorithms Usually start with a random (partial) partitioning Refine it iteratively K means clustering Hierarchical algorithms Create a hierarchy Bottom-up, agglomerative Top-down, divisive 13

  14. Introduction to Information Retrieval Flat (Partitioning) Algorithms Partitioning method: Construct a partition of n documents into a set of K clusters n K Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal: exhaustively enumerate all partitions Effective heuristic methods: K-means and K-medoids algorithms 14

  15. Introduction to Information Retrieval Hard vs. Soft clustering Hard clustering: Each document belongs to exactly one cluster. More common and easier to do Soft clustering: A document can belong to more than one cluster. For applications like creating browsable hierarchies Ex. Put sneakers in two clusters: sports apparel, shoes You can only do that with a soft clustering approach. *only hard clustering is discussed in this class. 15 15

  16. Introduction to Information Retrieval K-means algorithm 16

  17. Introduction to Information Retrieval K-means Perhaps the best known clustering algorithm Simple, works well in many cases Use as default / baseline for clustering documents 17 17

  18. Introduction to Information Retrieval K-means In vector space model, Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: x c | | 1 = (c) x c Reassignment of instances to clusters is based on distance to the current cluster centroids. 18

  19. Introduction to Information Retrieval K-means algorithm 1. Select K random docs {s1, s2, sK} as seeds. 2. Until clustering converges or other stopping criterion: 2.1 For each doc di: Assign di to the cluster cjsuch that dist(xi, sj) is minimal. 2.2 For each cluster cj sj = (cj) (Update the seeds to the centroid of each cluster) 19

  20. Introduction to Information Retrieval K-means algorithm 20

  21. Introduction to Information Retrieval K-means example (K=2) Pick seeds Reassign clusters Compute centroids Reassign clusters Compute centroids Reassign clusters x x x x x x Converged! 3 4 21

  22. Introduction to Information Retrieval Termination conditions Several possibilities, e.g., A fixed number of iterations. Doc partition unchanged. Centroid positions don t change. 22

  23. Introduction to Information Retrieval Convergence of K-Means Why should the K-means algorithm ever reach a fixed point? A state in which clusters don t change. K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. EM is known to converge. Number of iterations could be large. 23

  24. Introduction to Information Retrieval Convergence of K-Means : Define goodness measure of cluster k as sum of squared distances from cluster centroid: Gk = i (di ck)2 (sum over all di in cluster k) G = k Gk Reassignment monotonically decreases G since each vector is assigned to the closest centroid. G 24

  25. Introduction to Information Retrieval Time Complexity Computing distance between two docs is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(Kn) distance computations, or O(Knm). Computing centroids: Each doc gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(IKnm). I K n m scalable , , 25

  26. Introduction to Information Retrieval Issue (1) Seed Choice Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic (e.g., doc least similar to any existing mean) Try out multiple starting points Example showing sensitivity to seeds In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} 26

  27. Introduction to Information Retrieval Issue (2) How Many Clusters? Number of clusters K is given Partition n docs into predetermined number of clusters Finding the right number of clusters is part of the problem Given docs, partition into an appropriate number of subsets. E.g., for query results - ideal value of K not known up front - though UI may impose limits. 27

  28. Introduction to Information Retrieval If K not specified in advance Suggest K automatically using heuristics based on N using K vs. Cluster-size diagram Tradeoff between having less clusters (better focus within each cluster) and having too many clusters 28

  29. Introduction to Information Retrieval K-means variations Recomputing the centroid after every assignment (rather than after all points are re-assigned) can improve speed of convergence of K-means 29

  30. Introduction to Information Retrieval Evaluation of Clustering 30

  31. Introduction to Information Retrieval What Is A Good Clustering? Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measure used 31

  32. Introduction to Information Retrieval External criteria for clustering quality Based on a gold standard data set (ground truth) e.g., the Reuters collection we also used for the evaluation of classification Goal: Clustering should reproduce the classes in the gold standard Quality measured by its ability to discover some or all of the hidden patterns 32

  33. Introduction to Information Retrieval External criterion: Purity = { 1, 2, . . . , K} is the set of clusters and C = {c1, c2, . . . , cJ} is the set of classes. For each cluster k : find class cj with most members nkj in k Sum all nkj and divide by total number of points purity 33 33

  34. Introduction to Information Retrieval Example for computing purity To compute purity: 5 = maxj| 1 cj| (class x, cluster 1) 4 = maxj| 2 cj| (class o, cluster 2) 3 = maxj | 3 cj | (class , cluster 3) Purity is (1/17) (5 + 4 + 3) 0.71. 34 34

  35. Introduction to Information Retrieval Rand Index Same Cluster in clustering Different Clusters in clustering Number of points Same class in ground truth A C Different classes in ground truth B D 35

  36. Introduction to Information Retrieval Rand index: symmetric version + A D = RI + + + A B C D Compare with standard Precision and Recall. A + A + = = R P A C A B 36

  37. Introduction to Information Retrieval Rand Index example: 0.68 Different Clusters in clustering Same Cluster in clustering Number of points Same class in ground truth 20 24 Different classes in ground truth 20 72 37

  38. Introduction to Information Retrieval DBSCAN algorithm 38

  39. Introduction to Information Retrieval DBSCAN Density-based clustering : clusters are dense regions in the data space, separated by regions of lower object density. A cluster is defined as a maximal set of density-connected points May discovers clusters of arbitrary shape Ref: Density-Based Spatial Clustering of Applications with Noise 39 39

  40. Introduction to Information Retrieval DBSCAN Definition Eps-neighborhood of point p : points within radius eps from p "High density" : Eps-neighborhood of a point contains at least MinPts of points For radius , MinPts=4. Density of p is "high" Density of q is "low" 40 40

  41. Introduction to Information Retrieval DBSCAN Core points, Border points, and Noise points A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is any point that is not a core point nor a border point. 41 41

  42. Introduction to Information Retrieval Example 42 See http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf

  43. Introduction to Information Retrieval DBSCAN Directly density-reachable point q is directly density-reachable from point p if p is a core object and q is in eps-neighborhood of p q is directly density-reachable from p p is not directly density-reachable from q MinPts=4 density-reachability is asymmetric 43 43

  44. Introduction to Information Retrieval 44

  45. Introduction to Information Retrieval 45

  46. Introduction to Information Retrieval 46

  47. Introduction to Information Retrieval 47

  48. Introduction to Information Retrieval 48

  49. Introduction to Information Retrieval Visualize the algorithm http://www.naftaliharris.com/blog/visualizing-dbscan-clustering/ 49

  50. Introduction to Information Retrieval After clustering. 50

Related


More Related Content