
Advanced Techniques in Machine Learning
Explore topics in machine learning including support vector machines, logistic regression, gradient descent, Lasso regularization, and more. Understand the concepts behind single and multiple predictors, as well as terminology related to regularization functions and solvers. Dive deep into the world of machine learning with these informative slides.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Support Vector Machine I Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824
Administrative Please use piazza. No emails. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW 2 on the way.
Regularized logistic regression ?? = ?(?0+ ?1? + ?2?2+ ?3?1 ?6?1 2+ ?5?1?2+ 3+ ) 2+ ?4?2 3?2+ ?7?1?2 Age Tumor Size Cost function: ? ? =1 ? ? +? 2 ??log ??? + (1 ??)log 1 ??? ? 2 ?=1 ?? ?=1 Slide credit: Andrew Ng
Gradient descent (Regularized) Repeat { ?0 ?0 ?1 ? 1 ?? = ??? ?? ? ?=1 1 + ? ? ? ? 1 ? ?=1 ?+ ??? ??? ?? ?? ?? ? ?? } ? ?(?) ??? Slide credit: Andrew Ng
?1: Lasso regularization ? ? 1 2+ ? ??? ?? ? ? = 2? |??| ?=1 ?=1 LASSO: Least Absolute Shrinkage and Selection Operator
Single predictor: Soft Thresholding 2+ ? ?1 1 ? ?(?)? ?? minimize? 2? ?=1 1 ?< ?,? > ? 1 ?< ?,? > > ? 1 ?| < ?,? > | ? 1 ?< ?,? > < ? if ? = 0 if 1 ?< ?,? > +? if ? = ??(1 ?< ?,? >) Soft Thresholding operator ??? = sign ? ? ?+
Multiple predictors: : Cyclic Coordinate Descenet 2 1 ???+ ? ???? ??? ?? ? minimize? 2? ?=1 ?? + ? |??| + ? ??1 ? ? For each ?, update ?? with ? 1 (?)2 ??? ?? minimize? 2? ?? + ? ??1 ?=1 ??? (?)= ?? ? ???? where ??
L1 and L2 balls Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf
Terminology Regularization function Name Solver ? Tikhonov regularization Ridge regression Close form 2 2= ?2 ?? ?=1 ? LASSO regression Proximal gradient descent, least angle regression Proximal gradient descent ? 1= |??| ?=1 2Elastic net regularization ? ? 1+ (1 ?) ?2
Things to remember Overfitting Complex model: doing well on the training set, but perform poorly on the testing set Cost function Norm penalty: L1 or L2 Regularized linear regression Gradient decent with weight decay (L2 norm) Regularized logistic regression Gradient decent with weight decay (L2 norm)
Support Vector Machine Cost function Large margin classification Kernels Using an SVM
Support Vector Machine Cost function Large margin classification Kernels Using an SVM
Logistic regression ?(?) ?? = ? ? ? 1 ? ? = 1 + ? ? ? = ? ? Suppose predict y = 1 if ?? 0.5 ? = ? ? 0 predict y = 0 if ?? < 0.5 ? = ? ? < 0 Slide credit: Andrew Ng
Alternative view ?(?) ?? = ? ? ? 1 ? ? = 1 + ? ? ? = ? ? If y = 1 , we want ?? 1 ? = ? ? 0 If y = 0 , we want ?? 0 ? = ? ? 0 Slide credit: Andrew Ng
Cost function for Logistic Regression Logistic Regression Cost( ?? ,?) = log ?? if ? = 1 log 1 ?? if ? = 0 if ? = 1 if ? = 0 ?? 1 Slide credit: Andrew Ng ?? 1 0 0
Alternative view of logistic regression Cost ?? ,? = ? log h?x (1 y)log 1 ?? 1 1 = ? log + 1 y log 1 1 + ? ? ? 1 + ? ? ? 1 1 log log 1 1 + ? ? ? 1 + ? ? ? 0 if ? = 1 if ? = 0 ? ? ? ? 1 1 0 0 2 1 2 2 1 2
Logistic regression (logistic loss) 1 ? ?=1 ? ? ? 2 ?(?) log ??(?) + (1 ??) log 1 ??(?) min ? + 2? ?? ?=1 Support vector machine (hinge loss) 1 ? ?=1 ? ? ? 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? + 2? ?? ?=1 1 1 log log 1 1 + ? ? ? 1 + ? ? ? 0 if ? = 1 if ? = 0 ? ? 2 ? ? 1 1 0 0 2 1 2 2 1
Optimization objective for SVM ? ? 1 ? ? 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? + 2? ?? ?=1 ?=1 1) Multiply 1 ? 1 ? 2) Multiply C = ? ? 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? + ?? ?=1 ?=1 Slide credit: Andrew Ng
Hypothesis of SVM ? ? 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? + ?? ?=1 ?=1 Hypothesis ?? = 1 if ? ? 0 0 if ? ? < 0 Slide credit: Andrew Ng
Support Vector Machine Cost function Large margin classification Kernels Using an SVM
Support vector machine ? ?(?) cost1? ?? ? 2 + (1 ??) cost0? ?? min ? ? + ?? ?=1 ?=1 cost0? ? cost1? ? if ? = 1 if ? = 0 ? ? 2 ? ? 1 1 0 0 2 1 2 2 1 If y = 1 , we want ? ? 1(not just 0) If y = 0 , we want ? ? 1(not just < 0) Slide credit: Andrew Ng
SVM decision boundary ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1 Let s say we have a very large ? 1 2 ?=1 Whenever ??= 1: ? ?? 1 Whenever ??= 0: ? ?? 1 ? 2 min ? s.t. ? ?? 1 ? ?? 1 if y?= 0 ?? if y?= 1 Slide credit: Andrew Ng
SVM decision boundary: Linearly separable case ?2 ?1 Slide credit: Andrew Ng
SVM decision boundary: Linearly separable case ?2 margin ?1 Slide credit: Andrew Ng
Large margin classifier in the presence of outlier ? very large ?2 ? not too large ?1 Slide credit: Andrew Ng
Vector inner product ?1 ?2 ?1 ?2 ? = ? = ? = length of vector ? ? ?2 2+ ?2 2 = ?1 ? ?2 ? = length of projection of ? onto ? ? ? = ? ? = u1v1+ u2v2 ?1 ?1 Slide credit: Andrew Ng
SVM decision boundary 2 1 2 ?=1 1 2 ?=1 2=1 2=1 =1 ? 2 ? 2+ ?2 2+ ?2 2 2 min ? s.t. ? ?? 1 ? ?? 1 if y?= 0 Simplication: ?0= 0,? = 2 What s ? ??? ?? ?? 2?1 ?1 2? 2 if y?= 1 ?(?) ? ?2 ? ?2 ? ??= ?(?)? 2 ?(?) ? ?1 ?1 Slide credit: Andrew Ng
SVM decision boundary 1 2 Simplication: ?0= 0,? = 2 2 min ? s.t. ?1,?(2) small ? ? ?(?)? ?(?)? 2 1 2 1 if y?= 0 if y?= 1 2 large ?1,?(2) large ? 2 can be small ? ?2 ?2 ? ?1 ?1 Slide credit: Andrew Ng
Support Vector Machine Cost function Large margin classification Kernels Using an SVM
Non-linear decision boundary ?2 Predict ? = 1 if ?0+ ?1?1+ ?2?2+ ?3?1?2 + ?4?1 ?1 2+ ?5?2 2+ 0 ?0+ ?1?1+ ?2?2+ ?3?3+ ?1= ?1,?2= ?2,?3= ?1?2, Is there a different/better choice of the features ?1,?2,?3, ? Slide credit: Andrew Ng
Kernel Give ?, compute new features depending on proximity to landmarks ?(1), ?(2), ?(3) ?2 ?(2) ?(1) ?1= similarity(?,?(1)) ?2= similarity(?,?(2)) ?3= similarity(?,?(3)) ?(3) ?1 Gaussian kernel 2 ? ?? 2?2 similarity ?,?? = exp( ) Slide credit: Andrew Ng
?2 ?(2) ?(1) Predict ? = 1 if ?0+ ?1?1+ ?2?2+ ?3?3 0 ?(3) ?1 Ex: ?0= 0.5,?1= 1,?2= 1,?3= 0 ?1= similarity(?,?(1)) ?2= similarity(?,?(2)) ?3= similarity(?,?(3)) Slide credit: Andrew Ng
Choosing the landmarks Given ? 2 ? ?? 2?2 ??= similarity ?,?? = exp( ) Predict ? = 1 if ?0+ ?1?1+ ?2?2+ ?3?3 0 ?1 ?2 ?3 ?2 Where to get ?(1),?(2),?3, ? ?? ?1 Slide credit: Andrew Ng
SVM with kernels Given ?1,?1, ?2,?2, , ??,?? Choose ?1= ?1,?2= ?2, ?3= ?3, , ??= ?? Given example ?: ?1= similarity ?,?(1) ?2= similarity ?,?(2) For training example ??,??: ?? ?(?) ?0 ?1 ?2 ?? ? = Slide credit: Andrew Ng
SVM with kernels Hypothesis: Given ?, compute features ? ?+1 Predict ? = 1 if ? ? 0 Training (original) ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1 Training (with kernel) ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1
SVM parameters ? =1 Large ?: Lower bias, high variance. Small ?: Higher bias, low variance. ? ?2 Large ?2: features ?? vary more smoothly. Higher bias, lower variance Small ?2: features ?? vary less smoothly. Lower bias, higher variance Slide credit: Andrew Ng
SVM song https://www.youtube.com/watch?v=g15bqtyidZs Video source:
SVM Demo https://cs.stanford.edu/people/karpathy/svmjs/demo/
Support Vector Machine Cost function Large margin classification Kernels Using an SVM
Using SVM SVM software package (e.g., liblinear, libsvm) to solve for ? Need to specify: Choice of parameter ?. Choice of kernel (similarity function): Linear kernel: Predict ? = 1 if ? ? 0 Gaussian kernel: ? ?? 2?2 Need to choose ?2. Need proper feature scaling 2 ), where ?(?)= ?(?) ??= exp( Slide credit: Andrew Ng
Kernel (similarity) functions Note: not all similarity functions make valid kernels. Many off-the-shelf kernsl available: Polynomial kernel String kernel Chi-square kernel Histogram intersection kernel Slide credit: Andrew Ng
Multi-class classification ?2 ?1 Use one-vs.-all method. Train ? SVMs, one to distinguish ? = ? from the rest, get ?1,?(2), ,?(?) Pick class ? with the largest ?? ? Slide credit: Andrew Ng
Logistic regression vs. SVMs ? = number of features (? ?+1), ? = number of training examples 1. If ? is large (relative to ?): (? = 10,000,m = 10 1000) Use logistic regression or SVM without a kernel ( linear kernel ) 2. If ? is small, ? is intermediate: (? = 1 1000,m = 10 10,000) Use SVM with Gaussian kernel 3. If ? is small, ? is large: (? = 1 1000,m = 50,000+) Create/add more features, then use logistic regression of linear SVM Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng
Things to remember Cost function ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1 Large margin classification Kernels ?2 Using an SVM margin ?1