Machine Learning Algorithms and Models Overview

Slide Note

This class summary covers topics such as supervised learning, unsupervised learning, classification, clustering, regression, k-NN models, linear regression, Naive Bayes, logistic regression, and SVM formulations. The content provides insights into key concepts, algorithms, cost functions, learning approaches, and inferences related to machine learning.

arianell Follow

Uploaded on Jul 29, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Class Summary Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

Thank you all for participating this class! SPOT survey! Please give us feedback: lectures, topics, homework, exam, office hour, piazza

Machine learning algorithms Supervised Learning Unsupervised Learning Discrete Classification Clustering Dimensionality reduction Continuous Regression

k-NN (Classification/Regression) Model ?1,?1, ?2,?2, , ??,?? Cost function None Learning Do nothing Inference ? = ?test= ?(?), where ? = argmin? ?(?test,?(?))

Linear regression (Regression) Model ?? = ?0+ ?1?1+ ?2?2+ + ????= ? ? Cost function 1 2? Learning ? 2 ??? ?? ? ? = ?=1 1 ? ?=1 ?} ? ??? ?? 1) Gradient descent: Repeat {?? ?? ? 2) Solving normal equation ? = (? ?) 1? ? Inference ?? ? = ??test= ? ?test

Nave Bayes (Classification) Model ?? = ?(?|?1,?2, ,??) ? ? ?? ???) Cost function Maximum likelihood estimation: ? ? = log? Data ? Maximum a posteriori estimation :? ? = log? Data ? ? ? Learning ??= ?(? = ??) (Discrete ??) ????= ?(??= ????|? = ??) (Continuous ??) mean ???, variance ??? Inference ? argmax ?? 2,? ??? = ??) = ?(??|???,??? 2) test? = ??) ? ? = ?? ?? ??

Logistic regression (Classification) Model 1 ?? = ? ? = 1 ?1,?2, ,?? = 1+? ? ? Cost function ? ? =1 ? Learning Gradient descent: Repeat {?? ?? ? ? Cost( ?? ,?) = log ?? if ? = 1 Cost( ?(??),?(?))) log 1 ?? if ? = 0 ?=1 1 ? ?=1 ?} ? ??? ?? ?? Inference 1 ? = ??test= 1 + ? ? ?test

Hard-margin SVM formulation 1 2 ?=1 ? 2 min ? ?? s.t. ? ?? 1 ? ?? 1 if y?= 0 if y?= 1 ?2 margin Soft-margin SVM formulation ?1 1 2 ?=1 ? 2+ ? ??(?) min ? ?? ?2 s.t. ? ?? 1 ?(?) ? ?? 1 + ?? ?? 0 if y?= 1 if y?= 0 ? ?1

?0 ?1 ?2 ?? SVM with kernels ? = Hypothesis: Given ?, compute features ? ?+1 Predict ? = 1 if ? ? 0 Training (original) ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1 Training (with kernel) ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1

SVM parameters ? =1 Large ?: Lower bias, high variance. Small ?: Higher bias, low variance. ? ?2 Large ?2: features ?? vary more smoothly. Higher bias, lower variance Small ?2: features ?? vary less smoothly. Lower bias, higher variance Slide credit: Andrew Ng

Neural network (2) ?0 ?0 (2) ?1 ?1 Output (2) ?2 ? ?2 (2) ?3 ?3 Layer 2 Layer 3 Layer 1 Slide credit: Andrew Ng

Neural network (?)= activation of unit ? in layer ? ?? ?= matrix of weights controlling function mapping from layer ? to layer ? + 1 (2) ?0 ?0 (2) ?1 ?1 (2) ? ?2 ?2 ?? unit in layer ? ??+1 units in layer ? + 1 (2) ?3 ?3 (2)= ? 10 (2)= ? 20 (2)= ? 30 (1)?0+ 11 (1)?0+ 21 (1)?0+ 31 (2)?0 (1)?1+ 12 (1)?1+ 22 (1)?1+ 32 (2)+ 11 (1)?2+ 13 (1)?2+ 23 (1)?2+ 33 (2)+ 12 (1)?3 (1)?3 (1)?3 (2)+ 13 ?1 Size of ?? ?2 ?3 ??+1 (??+ 1) (1)?1 (1)?2 (1)?3 (2) (?) = ? 10 Slide credit: Andrew Ng

Neural network Pre-activation ?0 ?1 ?2 ?3 (2) z1 z2 z3 (2) ?0 ?0 (2) z(2)= ? = (2) ?1 ?1 (2) (2) ? ?2 ?2 (2) ?3 ?3 (2)= ? 10 (2)= ? 20 (2)= ? 30 (1)?0+ 11 (1)?0+ 21 (1)?0+ 31 2?0 (1)?1+ 12 (1)?1+ 22 (1)?1+ 32 2+ 11 (1)?2+ 13 (1)?2+ 23 (1)?2+ 33 2+ 12 (1)?3 = ?(z1 (1)?3 = ?(z2 (1)?3 = ?(z3 1?2 (2)) ?1 (2)) ?2 (2)) ?3 1?1 2+ 13 1?3 2 = ?(?(3)) ? = ? 10 Slide credit: Andrew Ng

Neural network Pre-activation ?0 ?1 ?2 ?3 (2) z1 z2 z3 (2) ?0 ?0 (2) z(2)= ? = (2) ?1 ?1 (2) (2) ? ?2 ?2 (2) ?3 ?3 ?(2)= (1)? = (1)?(1) ?(2)= ?(?(2)) Add ?0 ?(3)= (2)?(2) ? = ?(3)= ?(?(3)) (2)= ?(z1 (2)= ?(z2 (2)= ?(z3 (2)) (2)) (2)) ?1 ?2 ?3 ? = ?(?(3)) (2)= 1 Slide credit: Andrew Ng

Neural network learning its own features (2) ?0 ?0 (2) ?1 ?1 (2) ? ?2 ?2 (2) ?3 ?3 Slide credit: Andrew Ng

Bias / Variance Trade-off Training error Cross-validation error Loss Degree of Polynomial Source: Andrew Ng

Bias / Variance Trade-off Training error Cross-validation error High bias High Variance Loss Degree of Polynomial

Bias / Variance Trade-off with Regularization Training error Cross-validation error Loss Source: Andrew Ng

Bias / Variance Trade-off with Regularization Training error Cross-validation error High Variance High bias Loss Source: Andrew Ng

K-means algorithm Randomly initialize ? cluster centroids ?1,?2, ,?? ? Cluster assignment step ? ?1, ,??,?1, ,?? =1 ? Repeat{ closest to ?(?) for ? = 1 to ? ?? average (mean) of points assigned to cluster ? } 2 ?? ??? ? ?=1 for ? = 1 to ? ?(?) index (from 1 to ?) of cluster centroid Centroid update step ? ?1, ,??,?1, ,?? =1 ? 2 ?? ??? ? ?=1 Slide credit: Andrew Ng

Expectation Maximization (EM) Algorithm Goal: Find ? that maximizes log-likelihood ?log?(??;?) ?log?(??;?) = ?log ?(?)?(??,?(?);?) = ?log ?(?)???? ? ??,??;? ??(?(?)) log? ??,??;? ??(?(?)) ? ?(?)???? Jensen s inequality: ? ? ? ?[? ? ]

Expectation Maximization (EM) Algorithm Goal: Find ? that maximizes log-likelihood ?log?(??;?) log? ??,??;? ??(?(?)) ?log?(??;?) ? ?(?)???? - The lower bound works for all possible set of distributions ?? - We want tight lower-bound: ? ? ? = ?[? ? ] - When will that happen? ? = ? ? with probability 1 (? is a constant) ? ??,??;? ??(?(?)) = ?

How should we choose ??(?(?))? ? ??,??;? ??(?(?)) ??(?(?)) ? ??,??;? ???(?(?)) = 1 (because it is a distribution) = ? ? ??,??;? ?? ??,??;?= ? ??,??;? ? ??;? ???? = = ?(??|??;?)

EM algorithm Repeat until convergence{ (E-step) For each ?, set ???? ?(??|??;?) (Probabilistic inference) (M-step) Set log? ??,??;? ??(?(?)) ? argmax? ? ?(?)???? }

Anomaly detection algorithm 1. Choose features ?? that you think might be indicative of anomalous examples 2,?2 2, ,?? 2 2. Fit parameters ?1,?2, ,??, ?1 ?? = ?? 1 ? ?=1 1 ? ?=1 (?) ??? ?(?? ? ??)2 2= 3. Given new example ?, compute ? ? 2) ? ? = ? ?(??;??,?? Anomaly if ? ? < ?

Problem motivation ?1 ?2 Movie Alice (1) Bob (2) Carol (3) Dave (4) (romance) (action) Love at last 5 5 0 0 0.9 0 Romance forever 5 ? ? 0 1.0 0.01 Cute puppies of love ? 4 0 ? 0.99 0 Nonstop car chases 0 0 5 4 0.1 1.0 Swords vs. karate 0 0 5 ? 0 0.9

Problem motivation ?1 ?2 Movie Alice (1) Bob (2) Carol (3) Dave (4) (romance) (action) Love at last 5 5 0 0 ? ? Romance forever 5 ? ? 0 ? ? Cute puppies of love ? 4 0 ? ? ? Nonstop car chases 0 0 5 4 ? ? Swords vs. karate 0 0 5 ? ? ? 0 5 0 0 5 0 0 0 5 0 0 5 ? ? ? ?1= ?1= ?2= ?3= ?4=

Collaborative filtering optimization objective Given ?1,?2, ,???, estimate ?1,?2, ,??? 1 2 ?=1 ?:? ?,? =1 Given ?1,?2, ,???, estimate ?1,?2, ,??? 1 2 ?=1 ?:? ?,? =1 Minimize ?1,?2, ,??? and ?1,?2, ,??? simultaneously ? =1 2 ?:? ?,? =1 ?? ?? ? 2+? 2 ? (??) ?? ??,? min 2 ?=1 ?? ?1,?2, ,??? ?=1 ?? ?? ? 2+? (?)2 (??) ?? ??,? min 2 ?=1 ?? ?(1),?(2), ,?(??) ?=1 ?? ? ?? ? 2+? +? 2 (?)2 ? (??) ?? ??,? 2 ?=1 ?? 2 ?=1 ?? ?=1 ?=1

Collaborative filtering algorithm Initialize ?1,?2, ,???,?1,?2, ,??? to small random values Minimize ?(?1,?2, ,???,?1,?2, ,???) using gradient descent (or an advanced optimization algorithm). For every ? = 1 ??,? = 1, ,??: ?? (?) ? ?? ? ? ?+ ??? ( ?? ??,?) ?? ?? ?:? ?,? =1 ?? (?) ? ?? ? ? ?+ ??? ( ?? ??,?) ?? ?? ?:? ?,? =1 For a user with parameter ? and movie with (learned) feature ?, predict a star rating of ? ?

Semi-supervised Learning Problem Formulation Labeled data ?1,?1, ?2,?2, , ???,??? ??= Unlabeled data ?1,?1, ?2,?2, , ???,??? ??= Goal: Learn a hypothesis ? (e.g., a classifier) that has small error

Deep Semi-supervised Learning

Ensemble methods Ensemble methods Combine multiple classifiers to make better one Committees, majority vote Weighted combinations Can use same or different classifiers Boosting Train sequentially; later predictors focus on mistakes by earlier Boosting for classification (e.g., AdaBoost) Use results of earlier classifiers to know what to work on Weight hard examples so we focus on them more Example: Viola-Jones for face detection

Generative models

Simple Recurrent Network

Reinforcement learning Markov decision process Q-learning Policy gradient

Final exam sample questions

Conceptual questions [True/False] Increasing the value of k in a k-nearest neighbor classifier will decrease its bias [True/False] Backpropagation helps neural network training get unstuck from local minimum [True/False] Linear regression can be solved by either matrix algebra or gradient descent [True/False] Logistic regression can be solved by either matrix algebra or gradient descent [True/False] K-means clustering has a unique solution [True/False] PCA has a unique solution

Classification/Regression Given a simple dataset 1) Estimate the parameters 2) Compute training error 3) Compute leave-one-out cross-validation error 4) Compute testing error