Text Classification and Nave Bayes: The Power of Categorizing Documents
Text classification, also known as text categorization, involves assigning predefined categories to free-text documents. It plays a crucial role in organizing and extracting insights from vast amounts of unstructured data present in enterprise environments. With the exponential growth of unstructured data, the need for effective text analysis tools becomes increasingly vital for organizations to uncover valuable information hidden within these documents.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Text Classification and Nave Bayes The Task of Text Classification
Text Classification Text classification (text categorization) is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world1. Text Classification assigns one or more classes to a document according to their contents2. So, text classification assigns documents to one or more predefined classes. classes Documents ? class-1 class-2 . . . class-n 1www.scholarpedia.org/article/Text_categorization 2 2https://www.meaningcloud.com/.../text-classification/doc/1.1/what-is-text-classification
Need. Text(Unstructured data) makes up 80% and more of enterprise/real world data, and is growing at the rate of 55% and 65% per year. And without the tools to analyze this massive data, organizations are leaving vast amounts of valuable data to be tackled for digging out information . https://www.datamation.com/big-data/structured-vs-unstructured-data.html 3
The Authorship Attribution The Authorship Attribution Given the known writings of a small number of authors A1, ,An , determine which of them wrote a new X document. well known example: Who wrote which Federalist papers? 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton. Authorship of 12 of the letters in dispute 1963: solved by Mosteller and Wallace using Bayesian methods James Madison Alexander Hamilton
Gender identification Male or female author? By 1925 present-day Vietnam was divided into three parts under French colonial rule. The southern region embracing Saigon and the Mekong delta was the colony of Cochin-China; the central area with its imperial capital at Hue was the protectorate of Annam Clara never failed to be astonished by the extraordinary felicity of her own name. She found it hard to trust herself to the mercy of fate, which had managed over the years to convert her greatest shame into one of her greatest assets S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. Gender, Genre, and Writing Style in Formal Written Texts, Text, volume 23, number 3, pp. 321 346
Sentiment Analysis The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral. Positive or negative movie review?: unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists this is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes.
What is the subject of this article? Subject Category Hierarchy MEDLINE Article Antogonists Blood Supply Chemistry Drug Therapy Embryology Epidemiology ?
Text Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis
Text Classification: definition Input: a document d a fixed set of classes C ={c1, c2, , cJ} Output: a predicted class c C
Classification Methods: Hand-coded rules Rules based on combinations of words or other features spam: black-list-address OR ( dollars AND have been selected ) Accuracy is can be high if a rule has been carefully refined over time by a subject expert But building and maintaining these rules is expensive
Classification Methods: Supervised Machine Learning Input: a document d a fixed set of classes C ={c1, c2, , cJ} A training set of m hand-labeled documents (d1,c1),....,(dm, cj) Output: A learning method or algorithm which will enable us to learn a classifier For a test document d, we assign it the class :d c
Classification Methods: Supervised Machine Learning Any kind of classifier Na ve Bayes Logistic regression Support-vector machines k-Nearest Neighbors
Nave Bayes Intuition Simple ( na ve ) classification method based on Bayes rule Relies on very simple representation of document Bag of words
The bag of words representation I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. )=c (
The bag of words representation I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. )=c (
The bag of words representation: using a subset of words x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx )=c (
The bag of words representation great love 2 2 )=c ( recommend 1 laugh happy 1 1 ... ...
Bag of words for document classification ? Test document Machin e Learnin g Garbage Collection garbage collection memory optimization region... NLP Planning GUI planning temporal reasoning plan language... parser tag training translation language... ... parser language label translation learning training algorithm shrinkage network...
Formalizing the Nave Bayes Classifier
Bayes Rule Applied to Documents and Classes For a document dand a class c P(c|d)=P(d |c)P(c) P(d)
Nave Bayes Classifier (I) cMAP=argmax P(c|d) MAP is maximum a posteriori = most likely class c C P(d |c)P(c) P(d) P(d |c)P(c) =argmax c C =argmax c C Bayes Rule Dropping the denominator
Nave Bayes Classifier (II) cMAP=argmax P(d |c)P(c) c C Document d represented as features x1..xn = argmax c ( , , , | ) ( ) P x x x c P c 1 2 n C
Nave Bayes Classifier (IV) cMAP=argmax P(x1,x2, ,xn|c)P(c) c C O(|X|n |C|) parameters How often does this class occur? Could only be estimated if a very, very large number of training examples was available. We can just count the relative frequencies in a corpus
Multinomial Nave Bayes Independence Assumptions ( , , , | ) P x x x c 1 2 n Bag of Words assumption: Assume position doesn t matter Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c. = ( , , | ) ( | ) ( | ) ( | ) ... ( | ) P x x c P x c P x c P x c P x c 1 1 2 3 n n
Multinomial Nave Bayes Classifier cMAP=argmax P(x1,x2, ,xn|c)P(c) c C cNB=argmax P(cj) P(x|c) c C x X
Applying Multinomial Naive Bayes Classifiers to Text Classification positions all word positions in test document cNB=argmax P(cj) P(xi|cj) cj C i positions