Machine Learning Algorithms and Models Overview

Slide Note
Embed
Share

This class summary covers topics such as supervised learning, unsupervised learning, classification, clustering, regression, k-NN models, linear regression, Naive Bayes, logistic regression, and SVM formulations. The content provides insights into key concepts, algorithms, cost functions, learning approaches, and inferences related to machine learning.


Uploaded on Jul 29, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Class Summary Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

  2. Thank you all for participating this class! SPOT survey! Please give us feedback: lectures, topics, homework, exam, office hour, piazza

  3. Machine learning algorithms Supervised Learning Unsupervised Learning Discrete Classification Clustering Dimensionality reduction Continuous Regression

  4. k-NN (Classification/Regression) Model ?1,?1, ?2,?2, , ??,?? Cost function None Learning Do nothing Inference ? = ?test= ?(?), where ? = argmin? ?(?test,?(?))

  5. Linear regression (Regression) Model ?? = ?0+ ?1?1+ ?2?2+ + ????= ? ? Cost function 1 2? Learning ? 2 ??? ?? ? ? = ?=1 1 ? ?=1 ?} ? ??? ?? 1) Gradient descent: Repeat {?? ?? ? 2) Solving normal equation ? = (? ?) 1? ? Inference ?? ? = ??test= ? ?test

  6. Nave Bayes (Classification) Model ?? = ?(?|?1,?2, ,??) ? ? ?? ???) Cost function Maximum likelihood estimation: ? ? = log? Data ? Maximum a posteriori estimation :? ? = log? Data ? ? ? Learning ??= ?(? = ??) (Discrete ??) ????= ?(??= ????|? = ??) (Continuous ??) mean ???, variance ??? Inference ? argmax ?? 2,? ??? = ??) = ?(??|???,??? 2) test? = ??) ? ? = ?? ?? ??

  7. Logistic regression (Classification) Model 1 ?? = ? ? = 1 ?1,?2, ,?? = 1+? ? ? Cost function ? ? =1 ? Learning Gradient descent: Repeat {?? ?? ? ? Cost( ?? ,?) = log ?? if ? = 1 Cost( ?(??),?(?))) log 1 ?? if ? = 0 ?=1 1 ? ?=1 ?} ? ??? ?? ?? Inference 1 ? = ??test= 1 + ? ? ?test

  8. Hard-margin SVM formulation 1 2 ?=1 ? 2 min ? ?? s.t. ? ?? 1 ? ?? 1 if y?= 0 if y?= 1 ?2 margin Soft-margin SVM formulation ?1 1 2 ?=1 ? 2+ ? ??(?) min ? ?? ?2 s.t. ? ?? 1 ?(?) ? ?? 1 + ?? ?? 0 if y?= 1 if y?= 0 ? ?1

  9. ?0 ?1 ?2 ?? SVM with kernels ? = Hypothesis: Given ?, compute features ? ?+1 Predict ? = 1 if ? ? 0 Training (original) ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1 Training (with kernel) ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1

  10. SVM parameters ? =1 Large ?: Lower bias, high variance. Small ?: Higher bias, low variance. ? ?2 Large ?2: features ?? vary more smoothly. Higher bias, lower variance Small ?2: features ?? vary less smoothly. Lower bias, higher variance Slide credit: Andrew Ng

  11. Neural network (2) ?0 ?0 (2) ?1 ?1 Output (2) ?2 ? ?2 (2) ?3 ?3 Layer 2 Layer 3 Layer 1 Slide credit: Andrew Ng

  12. Neural network (?)= activation of unit ? in layer ? ?? ?= matrix of weights controlling function mapping from layer ? to layer ? + 1 (2) ?0 ?0 (2) ?1 ?1 (2) ? ?2 ?2 ?? unit in layer ? ??+1 units in layer ? + 1 (2) ?3 ?3 (2)= ? 10 (2)= ? 20 (2)= ? 30 (1)?0+ 11 (1)?0+ 21 (1)?0+ 31 (2)?0 (1)?1+ 12 (1)?1+ 22 (1)?1+ 32 (2)+ 11 (1)?2+ 13 (1)?2+ 23 (1)?2+ 33 (2)+ 12 (1)?3 (1)?3 (1)?3 (2)+ 13 ?1 Size of ?? ?2 ?3 ??+1 (??+ 1) (1)?1 (1)?2 (1)?3 (2) (?) = ? 10 Slide credit: Andrew Ng

  13. Neural network Pre-activation ?0 ?1 ?2 ?3 (2) z1 z2 z3 (2) ?0 ?0 (2) z(2)= ? = (2) ?1 ?1 (2) (2) ? ?2 ?2 (2) ?3 ?3 (2)= ? 10 (2)= ? 20 (2)= ? 30 (1)?0+ 11 (1)?0+ 21 (1)?0+ 31 2?0 (1)?1+ 12 (1)?1+ 22 (1)?1+ 32 2+ 11 (1)?2+ 13 (1)?2+ 23 (1)?2+ 33 2+ 12 (1)?3 = ?(z1 (1)?3 = ?(z2 (1)?3 = ?(z3 1?2 (2)) ?1 (2)) ?2 (2)) ?3 1?1 2+ 13 1?3 2 = ?(?(3)) ? = ? 10 Slide credit: Andrew Ng

  14. Neural network Pre-activation ?0 ?1 ?2 ?3 (2) z1 z2 z3 (2) ?0 ?0 (2) z(2)= ? = (2) ?1 ?1 (2) (2) ? ?2 ?2 (2) ?3 ?3 ?(2)= (1)? = (1)?(1) ?(2)= ?(?(2)) Add ?0 ?(3)= (2)?(2) ? = ?(3)= ?(?(3)) (2)= ?(z1 (2)= ?(z2 (2)= ?(z3 (2)) (2)) (2)) ?1 ?2 ?3 ? = ?(?(3)) (2)= 1 Slide credit: Andrew Ng

  15. Neural network learning its own features (2) ?0 ?0 (2) ?1 ?1 (2) ? ?2 ?2 (2) ?3 ?3 Slide credit: Andrew Ng

  16. Bias / Variance Trade-off Training error Cross-validation error Loss Degree of Polynomial Source: Andrew Ng

  17. Bias / Variance Trade-off Training error Cross-validation error High bias High Variance Loss Degree of Polynomial

  18. Bias / Variance Trade-off with Regularization Training error Cross-validation error Loss Source: Andrew Ng

  19. Bias / Variance Trade-off with Regularization Training error Cross-validation error High Variance High bias Loss Source: Andrew Ng

  20. K-means algorithm Randomly initialize ? cluster centroids ?1,?2, ,?? ? Cluster assignment step ? ?1, ,??,?1, ,?? =1 ? Repeat{ closest to ?(?) for ? = 1 to ? ?? average (mean) of points assigned to cluster ? } 2 ?? ??? ? ?=1 for ? = 1 to ? ?(?) index (from 1 to ?) of cluster centroid Centroid update step ? ?1, ,??,?1, ,?? =1 ? 2 ?? ??? ? ?=1 Slide credit: Andrew Ng

  21. Expectation Maximization (EM) Algorithm Goal: Find ? that maximizes log-likelihood ?log?(??;?) ?log?(??;?) = ?log ?(?)?(??,?(?);?) = ?log ?(?)???? ? ??,??;? ??(?(?)) log? ??,??;? ??(?(?)) ? ?(?)???? Jensen s inequality: ? ? ? ?[? ? ]

  22. Expectation Maximization (EM) Algorithm Goal: Find ? that maximizes log-likelihood ?log?(??;?) log? ??,??;? ??(?(?)) ?log?(??;?) ? ?(?)???? - The lower bound works for all possible set of distributions ?? - We want tight lower-bound: ? ? ? = ?[? ? ] - When will that happen? ? = ? ? with probability 1 (? is a constant) ? ??,??;? ??(?(?)) = ?

  23. How should we choose ??(?(?))? ? ??,??;? ??(?(?)) ??(?(?)) ? ??,??;? ???(?(?)) = 1 (because it is a distribution) = ? ? ??,??;? ?? ??,??;?= ? ??,??;? ? ??;? ???? = = ?(??|??;?)

  24. EM algorithm Repeat until convergence{ (E-step) For each ?, set ???? ?(??|??;?) (Probabilistic inference) (M-step) Set log? ??,??;? ??(?(?)) ? argmax? ? ?(?)???? }

  25. Anomaly detection algorithm 1. Choose features ?? that you think might be indicative of anomalous examples 2,?2 2, ,?? 2 2. Fit parameters ?1,?2, ,??, ?1 ?? = ?? 1 ? ?=1 1 ? ?=1 (?) ??? ?(?? ? ??)2 2= 3. Given new example ?, compute ? ? 2) ? ? = ? ?(??;??,?? Anomaly if ? ? < ?

  26. Problem motivation ?1 ?2 Movie Alice (1) Bob (2) Carol (3) Dave (4) (romance) (action) Love at last 5 5 0 0 0.9 0 Romance forever 5 ? ? 0 1.0 0.01 Cute puppies of love ? 4 0 ? 0.99 0 Nonstop car chases 0 0 5 4 0.1 1.0 Swords vs. karate 0 0 5 ? 0 0.9

  27. Problem motivation ?1 ?2 Movie Alice (1) Bob (2) Carol (3) Dave (4) (romance) (action) Love at last 5 5 0 0 ? ? Romance forever 5 ? ? 0 ? ? Cute puppies of love ? 4 0 ? ? ? Nonstop car chases 0 0 5 4 ? ? Swords vs. karate 0 0 5 ? ? ? 0 5 0 0 5 0 0 0 5 0 0 5 ? ? ? ?1= ?1= ?2= ?3= ?4=

  28. Collaborative filtering optimization objective Given ?1,?2, ,???, estimate ?1,?2, ,??? 1 2 ?=1 ?:? ?,? =1 Given ?1,?2, ,???, estimate ?1,?2, ,??? 1 2 ?=1 ?:? ?,? =1 Minimize ?1,?2, ,??? and ?1,?2, ,??? simultaneously ? =1 2 ?:? ?,? =1 ?? ?? ? 2+? 2 ? (??) ?? ??,? min 2 ?=1 ?? ?1,?2, ,??? ?=1 ?? ?? ? 2+? (?)2 (??) ?? ??,? min 2 ?=1 ?? ?(1),?(2), ,?(??) ?=1 ?? ? ?? ? 2+? +? 2 (?)2 ? (??) ?? ??,? 2 ?=1 ?? 2 ?=1 ?? ?=1 ?=1

  29. Collaborative filtering algorithm Initialize ?1,?2, ,???,?1,?2, ,??? to small random values Minimize ?(?1,?2, ,???,?1,?2, ,???) using gradient descent (or an advanced optimization algorithm). For every ? = 1 ??,? = 1, ,??: ?? (?) ? ?? ? ? ?+ ??? ( ?? ??,?) ?? ?? ?:? ?,? =1 ?? (?) ? ?? ? ? ?+ ??? ( ?? ??,?) ?? ?? ?:? ?,? =1 For a user with parameter ? and movie with (learned) feature ?, predict a star rating of ? ?

  30. Semi-supervised Learning Problem Formulation Labeled data ?1,?1, ?2,?2, , ???,??? ??= Unlabeled data ?1,?1, ?2,?2, , ???,??? ??= Goal: Learn a hypothesis ? (e.g., a classifier) that has small error

  31. Deep Semi-supervised Learning

  32. Ensemble methods Ensemble methods Combine multiple classifiers to make better one Committees, majority vote Weighted combinations Can use same or different classifiers Boosting Train sequentially; later predictors focus on mistakes by earlier Boosting for classification (e.g., AdaBoost) Use results of earlier classifiers to know what to work on Weight hard examples so we focus on them more Example: Viola-Jones for face detection

  33. Generative models

  34. Simple Recurrent Network

  35. Reinforcement learning Markov decision process Q-learning Policy gradient

  36. Final exam sample questions

  37. Conceptual questions [True/False] Increasing the value of k in a k-nearest neighbor classifier will decrease its bias [True/False] Backpropagation helps neural network training get unstuck from local minimum [True/False] Linear regression can be solved by either matrix algebra or gradient descent [True/False] Logistic regression can be solved by either matrix algebra or gradient descent [True/False] K-means clustering has a unique solution [True/False] PCA has a unique solution

  38. Classification/Regression Given a simple dataset 1) Estimate the parameters 2) Compute training error 3) Compute leave-one-out cross-validation error 4) Compute testing error

  39. Nave Bayes Compute individual probabilities Compute ? ? = 1 ?1= 0,?2= 1,?3= 1 using Na ve Bayes classifier

Related