Foundations of Probabilistic Models for Classification in Machine Learning
This content delves into the principles and applications of probabilistic models for binary classification problems, focusing on algorithms and machine learning concepts. It covers topics such as generative models, conditional probabilities, Gaussian distributions, and logistic functions in the context of machine learning. The discussion includes insights on decision-making processes based on conditional probabilities and Bayes' rule. The content also explores the use of generative processes for data generation and the formulation of class boundaries in classification models.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Probabilistic Models for Classification 1 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Binary Classification Problem N iid training samples: {??,??} Class label: ?? {0,1} Feature vector: ? ?? Focus on modeling conditional probabilities ?(?|?) Needs to be followed by a decision step 2 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Generative models for classification Model joint probability ? ?,? = ? ? ?(?|?) Class posterior probabilities via Bayes rule ? ? ? ?(?,?) Prior probability of a class: ?(? = ?) Class conditional probabilities: ?(? = ?|? = ?) 3 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Generative Process for Data Enables generation of new data points Repeat N times Sample class ?? ? ? Sample feature value ?? ?(?|??) 4 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Conditional Probability in a Generative Model ? ? = 1 ? ? ? = 1 ?(?|? = 1) = ? ? = 1 ?(?|? = 1) + ? ? = 0 ?(?|? = 0) 1 1 + exp{ ?} ?(?) = where ? = ln(? ?=1 ?(?|?=1) ? ?=0 ?(?|?=0)) Logistic function ?() Independent of specific form of class conditional probabilities 5 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Case: Binary classification with Gaussians Prior class probability ? ??? ? ? ?;? = ??1 ?1 ? Gaussian class densities ? ? ? = ? = ? ??, 2??/2 1/2exp{ ? ?? 1 ? 1(? ??)} = Parameters = ?,?0,?1, Note: Covariance parameter is shared 6 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Case: Binary classification with Gaussians ? ? = 1 ? = ? ??? + ?0 Where ? = 1?1 ?0 ? 1?1+1 ?0= 1 ? ? 1?0+ log 2?11 2?0 1 ? Quadratic term cancels out Linear classification model Class boundary ??? + ?0= 0 7 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Special Cases = ?;? = 1 ? = 0.5 Class boundary: ? =1 2(?0+ ?1) = ?;? 1 ? Class boundary shifts by log ? 1 ? Arbitrary Decision boundary still linear but not orthogonal to the hyper-plane joining the two means Image from Michael Jordan s book 8 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
MLE for Binary Gaussian Formulate loglikelihood in terms of parameters ? = log? ???(??|??) ? = ??log? + 1 ?? log(1 ?) ? +??log?(??|?1, ) + (1 ??)log?(??|?0, ) Maximize loglikelihood wrt parameters ?? ??1 ?? ??= 0 ???= ??? = 0 ?1??= ????? ??? ? ??= ? 9 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Case: Gaussian Multi-class Classification ? 1,2, ,? Prior ? ? = ? = ??;?? 0, ???= 0 Class conditional densities ? ? ? = ? = ?(??, ) exp{??} ?exp{??} ? ? = ? ? = where ??= log? ? = ? ?(?|? = ?) Soft-max / normalized exponential function For Gaussian class conditionals ??= ?? The decision boundaries are still lines in the feature space ? 1? 1 ? 1?? ?+ log?? 2?? 10 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
MLE for Gaussian Multi-class Similar to the Binary case 11 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Case: Nave Bayes Similar to Gaussian setting, only features are discrete (binary, for simplicity) Na ve Assumption: Feature dimensions ?? conditionally independent given class label Very different from independence assumption 12 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Case: Nave Bayes Class conditional probability ? ? ??1 ??? 1 ?? ? ? ? = ?;? = ? ??? = ?;? = ??? ?=1 ?=1 Posterior probability exp{??} ?exp{??} ? ? = ? ? = Where ??= log??+ ?[??log???+ 1 ?? log1 ???] 13 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
MLE for Nave Bayes Formulate loglikelihood in terms of parameters = ?,? ? = ???[???log???+ (1 ???log(1 ???))] + ???log?? ? ? ? ? ? Maximize likelihood wrt parameters ??= argmaxl ?.?. ??= 1 ? ?????= ??????? ????= ???? ???? ? MLE overfits Susceptible to 0 frequencies in training data 14 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Bayesian Estimation for Nave Bayes Model the parameters as random variables and analyze posterior distributions Take point estimates if necessary ? ???? ?,? ??? ??? ???? ??,?? ????+ ? 1 ? + ? + ? 2 ???????+ ?? 1 ????+ ??+ ?? 2 ????= ??????= 15 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Discriminative Models for Classification Familiar form for posterior class distribution ?? + ?0} ?? + ?0 exp{?? ?exp ?? ? ? = ? ? = Model posterior distribution directly Advantages as classification model Fewer assumptions, fewer parameters Image from Michael Jordan s book 16 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Logistic Regression for Binary Classification Apply model for binary setting 1 ? ? ? ? = 1 ? = 1 + exp ??? Formulate likelihood with weights as parameters 1 ?? ?? 1 ? ?? ? ? = ? ?? ? ? ? = ???log? + 1 ?? log(1 ?) where ? = 1+exp{ ????} 1 17 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
MLE for Binary Logistic Regression Maximize likelihood wrt weights ??(?) ?? = ??(? ?) No closed form solution 18 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
MLE for Binary Logistic Regression Not quadratic but still convex Iterative optimization using gradient descent (LMS algorithm) Batch gradient update ?(?+1)= ??+ ? ???(?? ?(??)) Stochastic gradient descent update ?(?+1)= ??+ ???(?? ?(??)) Faster algorithm Newton s Method Iterative Re-weighted least squares (IRLS) 19 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Bayesian Binary Logistic Regression Bayesian model exists, but intractable Conjugacy breaks down because of the sigmoid function Laplace approximation for the posterior Major challenge for Bayesian framework 20 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Soft-max regression for Multi-class Classification Left as exercise 21 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Choices for the activation function Probit function: CDF of the Gaussian Complementary log-log model: CDF of exponential 22 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Generative vs Discriminative: Summary Generative models Easy parameter estimation Require more parameters OR simplifying assumptions Models and understands each class Easy to accommodate unlabeled data Poorly calibrated probabilities Discriminative models Complicated estimation problem Fewer parameters and fewer assumptions No understanding of individual classes Difficult to accommodate unlabeled data Better calibrated probabilities 23 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Decision Theory From posterior distributions to actions Loss functions measure extent of error Optimal action depends on loss function Reject option for classification problems 24 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Loss functions 0-1 loss ? ?,? = ? ? ? =0 ?? ? = ? Minimized by MAP estimate (posterior mode) 1 ?? ? ? ?2loss ? ?,? = ? ?2 Expected loss: ?[ ? ?2|?] (Min mean squared error) Minimized by Bayes estimate (posterior mean) ?1loss ? ?,? = ? ? Minimized by posterior median 25 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Evaluation of Binary Classification Models Consider class conditional distribution ?(?|?) Decision rule: ? = 1 ?? ?(?|?) > ? Confusion Matrix ???? ? 1 0 ?+ ? ? ???????? ? 1 ?? ?? 0 ?? ?? ?+ ? 26 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
ROC curves ? = 1 ? = 0 ??/? ? = 1 ??/?+ = ???,???????????,?????? = ???,???? ? ????? ? = 0 ??/?+ ??/? = ???,???? ?? ????? = ???,??????????? 27 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
ROC curves Plot TPR and FPR for different values of decision threshold Quality of classifier measured by area under the curve (AUC) Image from wikipedia 28 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Precision-recall curves In settings such as information retrieval, ? ?+ Precision = ?? ?+ Recall = ?? ?+ Plot precision vs recall for varying values of threshold Quality of classifier measured by area under the curve (AUC) or by specific values e.g. P@k Image from scikit-learn 29 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
F1-scores To evaluate at a single threshold, need to combine precision and recall 2?? ?+? 1+?2?.? ?2?+? when P and R and not equally important ?1 = ??= Harmonic mean Why? 30 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Estimating generalization error Training set performance is not a good indicator of generalization error A more complex model overfits, a less complex one underfits Which model do I select? Validation set Typically 80%, 20% Wastes valuable labeled data Cross validation Split training data into K folds For ith iteration, train on K/i folds, test on ith fold Average generalization error over all folds Leave one out cross validation: K=N 31 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
Summary Generative models Gaussian Discriminant Analysis Na ve Bayes Discriminative models Logistics regression Iterative algorithms for training Binary vs Multiclass Evaluation of classification models Generalization performance Cross validation 32 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya