Understanding Naive Bayes Classifiers and Bayes Theorem

Slide Note
Embed
Share

Naive Bayes classifiers, based on Bayes' rules, are simple classification methods that make the naive assumption of attribute independence. Despite this assumption, Bayesian methods can still be effective. Bayes theorem is utilized for classification by combining prior knowledge with observed data, resulting in probabilistic hypotheses and probability distributions over classes. Properties of Bayes classifier, Bayes classifiers assumption, and parameter estimation are key concepts discussed in this informative content.


Uploaded on Jul 20, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Nave Bayes Classifiers Simple (na ve) classification methods based on Bayes rules Yuzhen Ye (Fall 2021)

  2. Bayes Theorem ( | ) ( ) P X Y P Y = ( | ) P Y X ( ) P X

  3. Example of Bayes Theorem Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what s the probability he/she has meningitis? / 1 ( | ) ( ) 5 . 0 50000 P S M P M = = = ( | ) . 0 0002 P M S ( ) / 1 20 P S

  4. Using Bayes Theorem for Classification Na ve Bayes classification: Na ve refers to the (na ve) assumption that data attributes are independent The Bayesian method can still be optimal even when this attribute independency is violated (Domingos, P., and M. Pazzani. 1997)

  5. Properties of Bayes Classifier Combines prior knowledge and observed data: prior probability of a hypothesis multiplied with probability of the hypothesis given the training data Probabilistic hypothesis: outputs not only a classification, but a probability distribution over all classes

  6. Bayes classifiers Assumption: training set consists of instances of different classes described cj as conjunctions of attributes values Task: Classify a new instance d based on a tuple of attribute values into one of the classes cj C Key idea: assign the most probable class using Bayes Theorem. c MAP = argmax c j ( | , , , ) c P c x x x 1 2 MAP j n C ( , P , , | ) ( ) P x x x c P c 1 2 n j j = argmax c j ( , , , ) x x x C 1 2 n = argmax c j ( , , , | ) ( ) P x x x c P c 1 2 n j j C

  7. Parameters estimation Use the frequencies in the data (MLE) P(cj) Can be estimated from the frequency of classes in the training examples. P(x1,x2, ,xn|cj) O(|X|n |C|) parameters Require large number of training examples Independence assumption: attribute values are conditionally independent given the target value: na ve Bayes. = j n c x x x P ) | , , , ( 2 1 i i ( | ) P x c i j = arg max ( ) ( | ) c P c P x c NB j i j c C j greatly reduces the number of parameters & data sparseness

  8. Bayes classification An unseen instance is classified by computing the class that maximizes the posterior = argmax c j ( | , , , ) c P c x x x 1 2 MAP j n C When conditioned independence is satisfied, Na ve Bayes corresponds to MAP classification. i = arg max ( ) ( | ) c P c P x c NB j i j c C j

  9. Example: play tennis data Day Outlook Temperature Humidity Wind Play Tennis Sunny Sunny Hot Hot High High Weak Strong Day1 Day2 No No Day3 Overcast Hot High Weak Yes Day4 Rain Mild High Weak Yes Day5 Rain Cool Normal Weak Yes Rain Overcast Cool Cool Normal Normal Strong Strong Day6 Day7 No Yes Sunny Mild High Weak Day8 No Day9 Sunny Cool Normal Weak Yes Day10 Rain Mild Normal Weak Yes Day11 Sunny Mild Normal Strong Yes Day12 Overcast Mild High Strong Yes Day13 Overcast Hot Normal Weak Yes Rain Mild High Strong Day14 No Question: For the day <sunny, cool, high, strong>, what s the play prediction?

  10. Naive Bayes solution Classify any new datum instance x=(x1, xn) as: To do this based on training examples, we need to estimate the parameters from the training examples: P(cj) & P(xi|cj)

  11. Based on the examples in the table, classify the following datum x: x=(O=Sunny, T=Cool, H=High, W=strong) That means: Play tennis or not? Working: = = = . 0 ( ) 9 / 14 . 0 64 P PlayTennis yes = = = ( ) 5 / 14 36 P PlayTennis no | = = = / 3 = . 0 ( ) / 3 9 . 0 33 P Wind ( strong PlayTennis yes = = = = | ) 5 60 P Wind . strong PlayTennis no etc = ( ) ( | ) ( | ) ( | ) ( | ) 0053 . 0 P yes P ( sunny yes ) P ( cool yes ) P high | yes ) P strong yes = 0.0206 ( ) | | ( ( | ) P no P sunny : no P cool ( no = P high no P strong no ) answer PlayTennis x no

  12. Underflow prevention Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. Class with highest final un-normalized log probability score is still the most probable. positions = + argmax log ( ) log ( | ) c P c P x c NB j i j c C i j

  13. Nave Bayes for interaction site prediction Figure 7. Na ve Bayes Classifier with Model Parameters in the Form of CPTs Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR (2007) A Primer on Learning in Bayesian Networks for Computational Biology. PLoS Comput Biol 3(8): e129. doi:10.1371/journal.pcbi.0030129 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0030129

  14. Nave Bayes for document classification Relies on a very simple representation of document: Bag of words I like this movie. It is sweet, and makes me laugh. I will definitely ( ) = C recommend it. I d like to watch it one more time! C: like-it dislike-it Use all the words; or a subset of the words

  15. Nave Bayes for document classification Relies on a very simple representation of document: Bag of words word like laugh recommen count 2 1 1 ( ) = C d .. .. .. ..

  16. Nave Bayes (Summary) Robust to isolated noise points Handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some attributes Use other techniques such as Bayesian Belief Networks (BBN)

Related