
Text Classification and Clustering Methods Overview
Explore text classification and clustering methods in this comprehensive lecture covering topics such as supervised vs. unsupervised learning, clustering techniques, similarity measurement, and more. Understand the difference between soft and hard clustering, hierarchical vs. non-hierarchical clustering, and the importance of features in document classification. Dive into the world of natural language processing with practical insights on categorizing text data efficiently.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Corpora and Statistical Methods Lecture 13 Albert Gatt
In this lecture Text categorisation overview of clustering methods machine learning methods for text classification
Text classification Given: a set of documents a set of categories Task: sort documents by category Examples: sort news text by topic (POLITICS, SPORT etc) sort email into SPAM/NON-SPAM classify documents by author
Setup Typical setup: identify relevant features of the documents individual words n-grams (e.g. bigrams) learn a model to classify a document Na ve Bayes method maximum entropy language models
Supervised vs unsupervised (cf. the un/supervised distinction for Word Sense Disambiguation; lecture 6) Supervised learning: training data is labeled several methods available (na ve Bayes, etc) Unsupervised learning: training data is unlabeled document classes have to be discovered possible method: clustering
Part 1 Clustering documents
Clustering Flat/non-hierarchical just sets of related documents no relationship between clusters very efficient algorithms exist e.g. k-means clustering Hierarchical related documents grouped in a tree (dendrogram) tree branches indicate similarity (resp distance) less efficient than non-hierarchical clustering n documents need n * n similarity computations but more informative
Soft vs hard clusters Hard clustering: each document belongs to exactly 1 class hierarchical methods are usually hard Soft clustering: allow degrees of membership e.g. p(c1|d1) > p(c2|d1) i.e. d1 belongs to c1 to a greater degree than to c2
Similarity & monotonicity All hierarchical methods require a similarity metric: similarity computed between individual documents and between clusters Vector-space representation for documents with cosine similarity is a common technique The similarity metric needs to be monotonic: , , ' c : ' ' c min( ( , ), ' c ( , ' ' c )) ( , ' ) ' ' c c sim c sim c sim c c i.e. we expect merging not to increase similarity otherwise, when we merge 2 clusters, their similarity to a third might change
Agglomerative clustering algorithm Given: D = {d1, ,dn} (the documents) similarity metric Initialise clusters C = {c1, ,cn} for {d1, ,dn} 1. j := n+1 2. do until |C| = 1 a. find the most similar pair (c,c ) in C b. create a new cluster cj= c U c c. remove c, c from C d. add cjto C e. j := j+1 3.
Agglomerative clustering - walkthrough Start with separate clusters for each document D1 D2 D3 D4 D5
Agglomerative clustering - walkthrough D1 and D2 are most similar D1 D2 D3 D4 D5
Agglomerative clustering - walkthrough D4 and D5 are most similar D1 D2 D3 D4 D5
Agglomerative clustering - walkthrough D3 and {D4,D5} are most similar D1 D2 D3 D4 D5
Agglomerative clustering - walkthrough Final step: merge last two clusters D1 D2 D3 D4 D5
Merging: single link strategy Similarity of two clusters = similarity of the two most similar members. Pro: good local coherence (high pairwise similarity) Con: elongated clusters (bad global coherence) c3 c4 c1 c2 sim c7 c8 c5 c6 sim
Merging: Complete link strategy Similarity of two clusters = similarity of the two most dissimilar members. better global coherence sim c3 c4 c1 c2 c7 c8 c5 c6 sim
Group average strategy Similarity of two clusters = average pairwise similarity. Compromise between local & global coherence. When using a vector-space representation with cosine similarity, the average similarity of a cluster C = {C1,C2} can be computed from the average similarity of its children C1 & C2. much more efficient than computing average pairwise similarity between all document pairs in C1 * C2!
Divisive clustering a kind of top-down hierarchical clustering also a greedy algorithm start with a single cluster representing all documents 1. iteratively divide clusters split the cluster which is least coherent (the cluster whose elements are least similar to eachother) to split a cluster C into {C1,C2}, one can run agglomerative clustering over the elements of C! therefore, computationally more expensive than pure agglomerative method 2.