Naive Bayes Classifier in Data Science

undefined
 
NAÏVE BAYES
 
CSC 576: Data Science
Today…
 
Probability Primer
Naïve Bayes
Bayes’ Rule
Conditional Probabilities
Probabilistic Models
Motivation
 
In many datasets, relationship between attributes and a
class variable is 
non-deterministic
.
Why?
Noisy data
Confounding and interaction of factors
Relevant variables not included in the data
Scenario
:
Risk of heart disease based on individual’s diet and workout
frequency
 
Scenario
 
Risk of heart disease based on individual’s diet and
workout frequency
Most people who “work out” and have a healthy diet
don’t get heart disease
Yet, some healthy individuals still do:
Smoking, alcohol abuse, …
 
What we’re trying to do
 
Model 
probabilistic relationships
“What is the 
probability that this person will get heart
disease
, 
given their diet and workout regimen
?”
Output is most similar to 
Logistic Regression
Will introduce 
naïve Bayes
 model
A type of 
Bayesian classifier
More advanced:
 
Bayesian network
 
Bayes Classifier
 
A 
probabilistic
 framework for solving 
classification
problems
Used in both 
naïve Bayes
 and 
Bayesian networks
Based on 
Bayes’ Theorem:
Terminology/Notation Primer
 
X
 and 
Y
 (two different variables)
Joint probability:
 
P(X=x, Y=y)
The probability that variable 
X
 takes on the value 
x
and
 
variable 
Y
 has the value 
y
Conditional probability:
 
P( Y=y | X=x )
Probability that variable 
Y
 has the value 
y, 
given
 
that
variable 
X
 takes on the value 
x
 
Given that I’m observed with an umbrella, what’s the probability that it will rain today?
What’s the probability that it rains today 
AND
 that I’m carrying an umbrella?
Terminology/Notation Primer
 
Single Probability:
“X has the value x”
Joint Probability:
“X and Y”
Conditional Probability:
“Y” given observation of “X”
Relation of Joint and Conditional Probabilities:
Terminology/Notation Primer
 
Bayes’ Theorem:
Predicted Probability Example
 
Scenario:
1.
A doctor knows that meningitis causes a stiff neck 50% of the
time
2.
Prior probability
 of any patient having meningitis is 1/50,000
3.
Prior probability
 of any patient having a stiff neck is 1/20
If a patient has a stiff neck, what’s the probability that they
have meningitis?
Predicted Probability Example
Apply Bayes’ Rule:
If a patient has a stiff neck, what’s the probability that they
have meningitis?
Interested in:
A doctor knows that meningitis causes a stiff neck 50% of the time
Prior probability
 of any patient having meningitis is 1/50,000
Prior probability
 of any patient having a stiff neck is 1/20
Very low probability
How to Apply Bayes’ Theorem to Data
Mining and Datasets?
 
Target
 class: 
Evade
Predictor 
variables: 
Refund,
Status, Income
What is probability of
Evade
 given the values of
Refund, Status, Income?
Above .5? Predict YES, else predict NO.
How to Apply Bayes’ Theorem to Data
Mining and Datasets?
 
How to compute?
Need test instance:
What are values of 
R, S, I
?
Test instance is:
Refund=Yes
Status=Married
Income=60K
Issue: we don’t have any training example that these
same three attributes values.
Naïve Bayes Classifier
 
Why called naïve?
Assumes that attributes (predictor variables) are
conditionally independent
.
No correlation
Big assumption!
What is conditionally independent?
Variable 
X
 is conditionally independent of 
Y
 if the following
holds:
Conditional Independence
Assuming variables 
X
 and 
Y
 and conditionally independent, can derive:
“given Z, what is
the joint probability
of X and Y?”
Naïve Bayes Classifier
Before (simple Bayes’ rule):
Single predictor variable X
 
Now we have a bunch of predictor
Variables: X
1
, X
2
, X
3
, …, X
n
Naïve Bayes Classifier
For binary problems: 
P(Y|X)
 > .5? 
Predict YES, else predict NO.
Example: will compute 
P(E=Yes | Status, Income, Refund) 
and 
P(E=No | Status, Income, Refund) 
Find which one is greater (greater likelihood)
 
Can compute from training data:
 
Cannot compute / hard to compute:
Not a problem, since the two denominators will be the same.
Need to see which numerator is greater.
Estimating Prior Probabilities for the
Class
 target
 
P(Evade=yes)
= 3/10
P(Evade=no)
=7/10
Estimating Conditional Probabilities for
Categorical
 Attributes
 
P(Refund=yes|Evade=no)
= 3/7
P(Status=married|Evade=yes)
=0/3
Yikes!
Will handle the 0% probability later
Estimating Conditional Probabilities for
Continuous
 Attributes
 
For continuous attributes:
1.
Discretize
 into bins
2.
Two-way split:
(A <= v)
 or 
(A > v)
Full Example
 
Given a Test Record:
 
 
P(NO) = 7/10
P(YES) = 3/10
 
P(Refund=YES|NO) = 3/7
P(Refund=NO|NO) = 4/7
P(Refund=YES|YES) = 0/3
P(Refund=NO|YES) = 3/3
 
P(Status=SINGLE|NO) = 2/7
P(Status=DIVORSED|NO) = 1/7
P(Status=MARRIED|NO) = 4/7
P(Status=SINGLE|YES) = 2/3
P(Status=DIVORSED|YES) = 1/3
P(Status=MARRIED|YES) = 0/3
 
For taxable income:
P(Income=above 101K|NO) = 3/7
P(Income=below101K|NO) = 4/7
P(Income=above 101K|YES) = 0/3
P(Income=below 101K|YES) = 3/3
 
Given a Test Record:
P(NO) = 7/10
P(YES) = 3/10
P(Refund=YES|NO) = 3/7
P(Refund=NO|NO) = 4/7
P(Refund=YES|YES) = 0/3
P(Refund=NO|YES) = 3/3
P(Status=SINGLE|NO) = 2/7
P(Status=DIVORSED|NO) = 1/7
P(Status=MARRIED|NO) = 4/7
P(Status=SINGLE|YES) = 2/3
P(Status=DIVORSED|YES) = 1/3
P(Status=MARRIED|YES) = 0/3
For taxable income:
P(Income=above 101K|NO) = 3/7
P(Income=below101K|NO) = 4/7
P(Income=above 101K|YES) = 0/3
P(Income=below 101K|YES) = 3/3
 
P(X|Class=No) = P(Refund=No|Class=No)
  
 
 P(Married| 
Class=No)
  
 
 P(Income=below 101K| Class=No)
 
              = 4/7 
 4/7 
 
4/7
 = 0.1866
 
P(X|Class=Yes) = P(Refund=No| Class=Yes)
   
 
                
 P(Married| 
Class=Yes)
   
 
                
 P(Income=below 101K| Class=Yes)
 
               = 1 
 0 
 1 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
      
=> Class = No
Smoothing
 of Conditional Probabilities
 
If one of the conditional probabilities is 0, then the
entire product will be 0
Idea: 
Instead use very small non-zeros values, such
as 0.00001
Smoothing of Conditional Probabilities
Idea: 
Instead use very small non-zeros values, such
as 0.00001
 
Given a Test Record:
P(NO) = 7/10
P(YES) = 3/10
P(Refund=YES|NO) = 4/9
P(Refund=NO|NO) = 5/9
P(Refund=YES|YES) = 1/5
P(Refund=NO|YES) = 4/5
P(Status=SINGLE|NO) = 3/9
P(Status=DIVORSED|NO) = 2/9
P(Status=MARRIED|NO) = 5/9
P(Status=SINGLE|YES) = 3/5
P(Status=DIVORSED|YES) = 2/5
P(Status=MARRIED|YES) = 1/5
For taxable income:
P(Income=above 101K|NO) = 4/9
P(Income=below101K|NO) = 5/9
P(Income=above 101K|YES) = 1/5
P(Income=below 101K|YES) = 4/5
 
l
P(X|Class=No) = P(Refund=No|Class=No)
  
 
 P(Married| 
Class=No)
  
 
 P(Income=below 101K| Class=No)
 
              = 5/
9
 
 5/
9
 
 
5/9
 = 0.1715
l
P(X|Class=Yes) = P(Refund=No| Class=Yes)
   
 
                  
 P(Married| 
Class=Yes)
   
 
                  
 P(Income=below 101K| Class=Yes)
 
               = 4/5 
 
1/5
 
 
4/5
 = 
0.128
Is P(X|No)P(No) > P(X|Yes)P(Yes)?
   .1715 x 7/10 > .128 x 3/10
Therefore P(No|X) > P(Yes|X)
      
=> Class = No
w/ Laplace Smoothing:
Characteristics of Naïve Bayes Classifiers
 
Robust to isolated noise
Noise is averaged out by estimating the conditional probabilities
from data
Handling missing values
Simply ignore them when estimating the probabilities
Robust to irrelevant attributes
If 
X
i
 is an irrelevant attribute, then 
P(X
i
|Y)
 becomes almost
uniformly distributed
P(Refund=Yes|YES)=0.5
P(Refund=Yes|NO)=0.5
Characteristics of Naïve Bayes Classifiers
 
Independence assumption may not hold for some
attributes
Correlated attributes can degrade performance of
naïve Bayes
But … naïve Bayes (for such a simple model), still
works surprisingly well even when there is some
correlation between attributes
 
References
 
Fundamentals of Machine Learning for Predictive
Data Analytics
, 1
st
 Edition, Kelleher et al.
Introduction to Data Mining
, 1
st
 edition, Tam et al.
Data Mining and Business Analytics with R
, 1
st
 edition,
Ledolter
Slide Note
Embed
Share

Naive Bayes classifier is a probabilistic framework used in data science for classification problems. It leverages Bayes' Theorem to model probabilistic relationships between attributes and class variables. The classifier is particularly useful in scenarios where the relationship between attributes and the class variable is non-deterministic, often due to noisy data, confounding factors, or missing relevant variables. By calculating conditional probabilities, Naive Bayes helps predict outcomes such as the risk of heart disease based on an individual's diet and workout frequency.

  • Naive Bayes
  • Data Science
  • Probabilistic Models
  • Classification
  • Bayesian Classifier

Uploaded on Aug 14, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. NAVE BAYES CSC 576: Data Science

  2. Today Probability Primer Na ve Bayes Bayes Rule Conditional Probabilities Probabilistic Models

  3. Motivation In many datasets, relationship between attributes and a class variable is non-deterministic. Why? Noisy data Confounding and interaction of factors Relevant variables not included in the data Scenario: Risk of heart disease based on individual s diet and workout frequency

  4. Scenario Risk of heart disease based on individual s diet and workout frequency Most people who work out and have a healthy diet don t get heart disease Yet, some healthy individuals still do: Smoking, alcohol abuse,

  5. What were trying to do Model probabilistic relationships What is the probability that this person will get heart disease, given their diet and workout regimen? Output is most similar to Logistic Regression Will introduce na ve Bayes model A type of Bayesian classifier More advanced: Bayesian network

  6. Bayes Classifier A probabilistic framework for solving classification problems Used in both na ve Bayes and Bayesian networks Based on Bayes Theorem:

  7. Whats the probability that it rains today AND that Im carrying an umbrella? Terminology/Notation Primer X and Y (two different variables) Joint probability: P(X=x, Y=y) The probability that variable X takes on the value x and variable Y has the value y Conditional probability: P( Y=y | X=x ) Probability that variable Y has the value y, given that variable X takes on the value x Given that I m observed with an umbrella, what s the probability that it will rain today?

  8. Terminology/Notation Primer P(X = x) Single Probability: X has the value x Joint Probability: X and Y Conditional Probability: Y given observation of X Relation of Joint and Conditional Probabilities: P(X,Y)=P(Y |X) P(X) P(X,Y) P(Y |X)

  9. Terminology/Notation Primer P(X,Y)=P(Y |X) P(X) P(Y,X)=P(X |Y) P(Y) P(X,Y)=P(Y,X) P(X,Y)=P(Y |X) P(X)=P(X |Y) P(Y) P(Y | X)=P(X |Y) P(Y) Bayes Theorem: P(X)

  10. Predicted Probability Example Scenario: A doctor knows that meningitis causes a stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having a stiff neck is 1/20 1. 2. 3. If a patient has a stiff neck, what s the probability that they have meningitis?

  11. A doctor knows that meningitis causes a stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Predicted Probability Example Prior probability of any patient having a stiff neck is 1/20 Apply Bayes Rule: If a patient has a stiff neck, what s the probability that they have meningitis? P(M |S) P(M |S)=P(S|M)P(M) P(S) Interested in: =0.5 150000 120 P(M |S)=P(S|M)P(M) =0.0002 P(S) Very low probability

  12. How to Apply Bayes Theorem to Data Mining and Datasets? class categorical categorical continuous Tid Refund Marital Taxable Income Target class: Evade Evade Status 1 Yes Single 125K No Predictor variables: Refund, Status, Income 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No What is probability of Evade given the values of Refund, Status, Income? 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No P(E|R,S,I) 10 No Single 90K Yes 10 Above .5? Predict YES, else predict NO.

  13. How to Apply Bayes Theorem to Data Mining and Datasets? class categorical categorical continuous Tid Refund Marital Taxable Income How to compute? Need test instance: What are values of R, S, I? Test instance is: Refund=Yes Status=Married Income=60K P(E|R,S,I) Evade Status 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No Issue: we don t have any training example that these same three attributes values. 10 No Single 90K Yes 10

  14. Nave Bayes Classifier Why called na ve? Assumes that attributes (predictor variables) are conditionally independent. No correlation Big assumption! What is conditionally independent? Variable X is conditionally independent of Y if the following holds: P(X |Y,Z)=P(X |Z)

  15. Conditional Independence Assuming variables X and Y and conditionally independent, can derive: P(X,Y |Z)=P(X,Y,Z) given Z, what is the joint probability of X and Y? P(Z) =P(X,Y,Z) P(Y,Z) = P(X |Y,Z) P(Y |Z) = P(X |Z) P(Y |Z) P(Y,Z) P(Z)

  16. Nave Bayes Classifier P(Y | X)=P(X |Y) P(Y) Before (simple Bayes rule): Single predictor variable X P(X) Now we have a bunch of predictor Variables: X1, X2, X3, , Xn P(Y |X1,X2,...,Xn) P(Y | X1,X2,...,Xn)=P(X1,X2,...,Xn|Y) P(Y) P(Y | X)=P(Y)Pi=1 dP(Xi|Y) P(X1,X2,...,Xn) P(X1,X2,...,Xn)

  17. Nave Bayes Classifier P(Y | X)=P(Y)Pi=1 dP(Xi|Y) For binary problems: P(Y|X) > .5? Predict YES, else predict NO. P(X1,X2,...,Xn) Example: will compute P(E=Yes | Status, Income, Refund) and P(E=No | Status, Income, Refund) Find which one is greater (greater likelihood) Cannot compute / hard to compute: P(X1,X2,...,Xn) P(Refund=yes,Status=married,Income=120k) Can compute from training data: P(E =Yes) P(Refund= No|E =Yes) P(E = No) P(Y) Not a problem, since the two denominators will be the same. Need to see which numerator is greater. P(X1|Y) P(X3|Y)

  18. Estimating Prior Probabilities for the Class target class categorical categorical continuous P(Y) Tid Refund Marital Taxable Income Evade Status P(Evade=yes) 1 Yes Single 125K No = 3/10 2 No Married 100K No 3 No Single 70K No P(Evade=no) 4 Yes Married 120K No 5 No Divorced 95K Yes =7/10 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10

  19. Estimating Conditional Probabilities for Categorical Attributes class categorical categorical continuous P(X1|Y) Tid Refund Marital Taxable Income Evade Status P(Refund=yes|Evade=no) 1 Yes Single 125K No = 3/7 2 No Married 100K No 3 No Single 70K No P(Status=married|Evade=yes) 4 Yes Married 120K No 5 No Divorced 95K =0/3 Yes 6 No Married 60K No Yikes! 7 Yes Divorced 220K No 8 No Single 85K Yes Will handle the 0% probability later 9 No Married 75K No 10 No Single 90K Yes 10

  20. Estimating Conditional Probabilities for Continuous Attributes class categorical categorical continuous P(X1|Y) Tid Refund Marital Taxable Income Evade Status For continuous attributes: Discretize into bins Two-way split: (A <= v) or (A > v) 1 Yes Single 125K No 1. 2 No Married 100K No 3 No Single 70K No 2. 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10

  21. Given a Test Record: X =(Refund=No,Married,Income=75K) P(NO) = 7/10 P(YES) = 3/10 categorical categorical continuous Full Example class Tid Refund Marital Taxable Income P(Refund=YES|NO) = 3/7 P(Refund=NO|NO) = 4/7 P(Refund=YES|YES) = 0/3 P(Refund=NO|YES) = 3/3 Evade Status 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No P(Status=SINGLE|NO) = 2/7 P(Status=DIVORSED|NO) = 1/7 P(Status=MARRIED|NO) = 4/7 P(Status=SINGLE|YES) = 2/3 P(Status=DIVORSED|YES) = 1/3 P(Status=MARRIED|YES) = 0/3 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes For taxable income: P(Income=above 101K|NO) = 3/7 P(Income=below101K|NO) = 4/7 P(Income=above 101K|YES) = 0/3 P(Income=below 101K|YES) = 3/3 9 No Married 75K No 10 No Single 90K Yes 10

  22. Given a Test Record: X =(Refund=No,Married,Income=75K) P(NO) = 7/10 P(YES) = 3/10 P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=below 101K| Class=No) = 4/7 4/7 4/7 = 0.1866 P(Refund=YES|NO) = 3/7 P(Refund=NO|NO) = 4/7 P(Refund=YES|YES) = 0/3 P(Refund=NO|YES) = 3/3 P(Status=SINGLE|NO) = 2/7 P(Status=DIVORSED|NO) = 1/7 P(Status=MARRIED|NO) = 4/7 P(Status=SINGLE|YES) = 2/3 P(Status=DIVORSED|YES) = 1/3 P(Status=MARRIED|YES) = 0/3 P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=below 101K| Class=Yes) = 1 0 1 = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No For taxable income: P(Income=above 101K|NO) = 3/7 P(Income=below101K|NO) = 4/7 P(Income=above 101K|YES) = 0/3 P(Income=below 101K|YES) = 3/3

  23. Smoothing of Conditional Probabilities If one of the conditional probabilities is 0, then the entire product will be 0 Idea: Instead use very small non-zeros values, such as 0.00001 Original: P(xi| yj)=nc n n= # of training examples that have value yj nc= # of examples from class yj that take on value xi

  24. Smoothing of Conditional Probabilities Idea: Instead use very small non-zeros values, such as 0.00001 Original: P(xi| yj)=nc n Laplace: P(xi| yj)=nc+1 n+C n = # of training examples that have value yj nc= # of examples from class yj that take on value xi C = # of classes

  25. Given a Test Record: w/ Laplace Smoothing: X =(Refund=No,Married,Income=75K) P(NO) = 7/10 P(YES) = 3/10 P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=below 101K| Class=No) = 5/9 5/9 5/9 = 0.1715 P(Refund=YES|NO) = 4/9 P(Refund=NO|NO) = 5/9 P(Refund=YES|YES) = 1/5 P(Refund=NO|YES) = 4/5 l P(Status=SINGLE|NO) = 3/9 P(Status=DIVORSED|NO) = 2/9 P(Status=MARRIED|NO) = 5/9 P(Status=SINGLE|YES) = 3/5 P(Status=DIVORSED|YES) = 2/5 P(Status=MARRIED|YES) = 1/5 P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=below 101K| Class=Yes) = 4/5 1/5 4/5 = 0.128 l Is P(X|No)P(No) > P(X|Yes)P(Yes)? .1715 x 7/10 > .128 x 3/10 Therefore P(No|X) > P(Yes|X) => Class = No For taxable income: P(Income=above 101K|NO) = 4/9 P(Income=below101K|NO) = 5/9 P(Income=above 101K|YES) = 1/5 P(Income=below 101K|YES) = 4/5

  26. Characteristics of Nave Bayes Classifiers Robust to isolated noise Noise is averaged out by estimating the conditional probabilities from data Handling missing values Simply ignore them when estimating the probabilities Robust to irrelevant attributes If Xiis an irrelevant attribute, then P(Xi|Y) becomes almost uniformly distributed P(Refund=Yes|YES)=0.5 P(Refund=Yes|NO)=0.5

  27. Characteristics of Nave Bayes Classifiers Independence assumption may not hold for some attributes Correlated attributes can degrade performance of na ve Bayes But na ve Bayes (for such a simple model), still works surprisingly well even when there is some correlation between attributes

  28. References Fundamentals of Machine Learning for Predictive Data Analytics, 1stEdition, Kelleher et al. Introduction to Data Mining, 1stedition, Tam et al. Data Mining and Business Analytics with R, 1stedition, Ledolter

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#