Advanced Techniques in Machine Learning

1 / 44

Embed Share

Explore topics in machine learning including support vector machines, logistic regression, gradient descent, Lasso regularization, and more. Understand the concepts behind single and multiple predictors, as well as terminology related to regularization functions and solvers. Dive deep into the world of machine learning with these informative slides.

jveron Follow

Uploaded on Apr 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Support Vector Machine I Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

Administrative Please use piazza. No emails. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW 2 on the way.

Regularized logistic regression ?? = ?(?0+ ?1? + ?2?2+ ?3?1 ?6?1 2+ ?5?1?2+ 3+ ) 2+ ?4?2 3?2+ ?7?1?2 Age Tumor Size Cost function: ? ? =1 ? ? +? 2 ??log ??? + (1 ??)log 1 ??? ? 2 ?=1 ?? ?=1 Slide credit: Andrew Ng

Gradient descent (Regularized) Repeat { ?0 ?0 ?1 ? 1 ?? = ??? ?? ? ?=1 1 + ? ? ? ? 1 ? ?=1 ?+ ??? ??? ?? ?? ?? ? ?? } ? ?(?) ??? Slide credit: Andrew Ng

?1: Lasso regularization ? ? 1 2+ ? ??? ?? ? ? = 2? |??| ?=1 ?=1 LASSO: Least Absolute Shrinkage and Selection Operator

Single predictor: Soft Thresholding 2+ ? ?1 1 ? ?(?)? ?? minimize? 2? ?=1 1 ?< ?,? > ? 1 ?< ?,? > > ? 1 ?| < ?,? > | ? 1 ?< ?,? > < ? if ? = 0 if 1 ?< ?,? > +? if ? = ??(1 ?< ?,? >) Soft Thresholding operator ??? = sign ? ? ?+

Multiple predictors: : Cyclic Coordinate Descenet 2 1 ???+ ? ???? ??? ?? ? minimize? 2? ?=1 ?? + ? |??| + ? ??1 ? ? For each ?, update ?? with ? 1 (?)2 ??? ?? minimize? 2? ?? + ? ??1 ?=1 ??? (?)= ?? ? ???? where ??

L1 and L2 balls Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf

Terminology Regularization function Name Solver ? Tikhonov regularization Ridge regression Close form 2 2= ?2 ?? ?=1 ? LASSO regression Proximal gradient descent, least angle regression Proximal gradient descent ? 1= |??| ?=1 2Elastic net regularization ? ? 1+ (1 ?) ?2

Things to remember Overfitting Complex model: doing well on the training set, but perform poorly on the testing set Cost function Norm penalty: L1 or L2 Regularized linear regression Gradient decent with weight decay (L2 norm) Regularized logistic regression Gradient decent with weight decay (L2 norm)

Support Vector Machine Cost function Large margin classification Kernels Using an SVM

Support Vector Machine Cost function Large margin classification Kernels Using an SVM

Logistic regression ?(?) ?? = ? ? ? 1 ? ? = 1 + ? ? ? = ? ? Suppose predict y = 1 if ?? 0.5 ? = ? ? 0 predict y = 0 if ?? < 0.5 ? = ? ? < 0 Slide credit: Andrew Ng

Alternative view ?(?) ?? = ? ? ? 1 ? ? = 1 + ? ? ? = ? ? If y = 1 , we want ?? 1 ? = ? ? 0 If y = 0 , we want ?? 0 ? = ? ? 0 Slide credit: Andrew Ng

Cost function for Logistic Regression Logistic Regression Cost( ?? ,?) = log ?? if ? = 1 log 1 ?? if ? = 0 if ? = 1 if ? = 0 ?? 1 Slide credit: Andrew Ng ?? 1 0 0

Alternative view of logistic regression Cost ?? ,? = ? log h?x (1 y)log 1 ?? 1 1 = ? log + 1 y log 1 1 + ? ? ? 1 + ? ? ? 1 1 log log 1 1 + ? ? ? 1 + ? ? ? 0 if ? = 1 if ? = 0 ? ? ? ? 1 1 0 0 2 1 2 2 1 2

Logistic regression (logistic loss) 1 ? ?=1 ? ? ? 2 ?(?) log ??(?) + (1 ??) log 1 ??(?) min ? + 2? ?? ?=1 Support vector machine (hinge loss) 1 ? ?=1 ? ? ? 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? + 2? ?? ?=1 1 1 log log 1 1 + ? ? ? 1 + ? ? ? 0 if ? = 1 if ? = 0 ? ? 2 ? ? 1 1 0 0 2 1 2 2 1

Optimization objective for SVM ? ? 1 ? ? 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? + 2? ?? ?=1 ?=1 1) Multiply 1 ? 1 ? 2) Multiply C = ? ? 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? + ?? ?=1 ?=1 Slide credit: Andrew Ng

Hypothesis of SVM ? ? 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? + ?? ?=1 ?=1 Hypothesis ?? = 1 if ? ? 0 0 if ? ? < 0 Slide credit: Andrew Ng

Support Vector Machine Cost function Large margin classification Kernels Using an SVM

Support vector machine ? ?(?) cost1? ?? ? 2 + (1 ??) cost0? ?? min ? ? + ?? ?=1 ?=1 cost0? ? cost1? ? if ? = 1 if ? = 0 ? ? 2 ? ? 1 1 0 0 2 1 2 2 1 If y = 1 , we want ? ? 1(not just 0) If y = 0 , we want ? ? 1(not just < 0) Slide credit: Andrew Ng

SVM decision boundary ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1 Let s say we have a very large ? 1 2 ?=1 Whenever ??= 1: ? ?? 1 Whenever ??= 0: ? ?? 1 ? 2 min ? s.t. ? ?? 1 ? ?? 1 if y?= 0 ?? if y?= 1 Slide credit: Andrew Ng

SVM decision boundary: Linearly separable case ?2 ?1 Slide credit: Andrew Ng

SVM decision boundary: Linearly separable case ?2 margin ?1 Slide credit: Andrew Ng

Large margin classifier in the presence of outlier ? very large ?2 ? not too large ?1 Slide credit: Andrew Ng

Vector inner product ?1 ?2 ?1 ?2 ? = ? = ? = length of vector ? ? ?2 2+ ?2 2 = ?1 ? ?2 ? = length of projection of ? onto ? ? ? = ? ? = u1v1+ u2v2 ?1 ?1 Slide credit: Andrew Ng

SVM decision boundary 2 1 2 ?=1 1 2 ?=1 2=1 2=1 =1 ? 2 ? 2+ ?2 2+ ?2 2 2 min ? s.t. ? ?? 1 ? ?? 1 if y?= 0 Simplication: ?0= 0,? = 2 What s ? ??? ?? ?? 2?1 ?1 2? 2 if y?= 1 ?(?) ? ?2 ? ?2 ? ??= ?(?)? 2 ?(?) ? ?1 ?1 Slide credit: Andrew Ng

SVM decision boundary 1 2 Simplication: ?0= 0,? = 2 2 min ? s.t. ?1,?(2) small ? ? ?(?)? ?(?)? 2 1 2 1 if y?= 0 if y?= 1 2 large ?1,?(2) large ? 2 can be small ? ?2 ?2 ? ?1 ?1 Slide credit: Andrew Ng

Support Vector Machine Cost function Large margin classification Kernels Using an SVM

Non-linear decision boundary ?2 Predict ? = 1 if ?0+ ?1?1+ ?2?2+ ?3?1?2 + ?4?1 ?1 2+ ?5?2 2+ 0 ?0+ ?1?1+ ?2?2+ ?3?3+ ?1= ?1,?2= ?2,?3= ?1?2, Is there a different/better choice of the features ?1,?2,?3, ? Slide credit: Andrew Ng

Kernel Give ?, compute new features depending on proximity to landmarks ?(1), ?(2), ?(3) ?2 ?(2) ?(1) ?1= similarity(?,?(1)) ?2= similarity(?,?(2)) ?3= similarity(?,?(3)) ?(3) ?1 Gaussian kernel 2 ? ?? 2?2 similarity ?,?? = exp( ) Slide credit: Andrew Ng

?2 ?(2) ?(1) Predict ? = 1 if ?0+ ?1?1+ ?2?2+ ?3?3 0 ?(3) ?1 Ex: ?0= 0.5,?1= 1,?2= 1,?3= 0 ?1= similarity(?,?(1)) ?2= similarity(?,?(2)) ?3= similarity(?,?(3)) Slide credit: Andrew Ng

Choosing the landmarks Given ? 2 ? ?? 2?2 ??= similarity ?,?? = exp( ) Predict ? = 1 if ?0+ ?1?1+ ?2?2+ ?3?3 0 ?1 ?2 ?3 ?2 Where to get ?(1),?(2),?3, ? ?? ?1 Slide credit: Andrew Ng

SVM with kernels Given ?1,?1, ?2,?2, , ??,?? Choose ?1= ?1,?2= ?2, ?3= ?3, , ??= ?? Given example ?: ?1= similarity ?,?(1) ?2= similarity ?,?(2) For training example ??,??: ?? ?(?) ?0 ?1 ?2 ?? ? = Slide credit: Andrew Ng

SVM with kernels Hypothesis: Given ?, compute features ? ?+1 Predict ? = 1 if ? ? 0 Training (original) ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1 Training (with kernel) ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1

SVM parameters ? =1 Large ?: Lower bias, high variance. Small ?: Higher bias, low variance. ? ?2 Large ?2: features ?? vary more smoothly. Higher bias, lower variance Small ?2: features ?? vary less smoothly. Lower bias, higher variance Slide credit: Andrew Ng

SVM song https://www.youtube.com/watch?v=g15bqtyidZs Video source:

SVM Demo https://cs.stanford.edu/people/karpathy/svmjs/demo/

Support Vector Machine Cost function Large margin classification Kernels Using an SVM

Using SVM SVM software package (e.g., liblinear, libsvm) to solve for ? Need to specify: Choice of parameter ?. Choice of kernel (similarity function): Linear kernel: Predict ? = 1 if ? ? 0 Gaussian kernel: ? ?? 2?2 Need to choose ?2. Need proper feature scaling 2 ), where ?(?)= ?(?) ??= exp( Slide credit: Andrew Ng

Kernel (similarity) functions Note: not all similarity functions make valid kernels. Many off-the-shelf kernsl available: Polynomial kernel String kernel Chi-square kernel Histogram intersection kernel Slide credit: Andrew Ng

Multi-class classification ?2 ?1 Use one-vs.-all method. Train ? SVMs, one to distinguish ? = ? from the rest, get ?1,?(2), ,?(?) Pick class ? with the largest ?? ? Slide credit: Andrew Ng

Logistic regression vs. SVMs ? = number of features (? ?+1), ? = number of training examples 1. If ? is large (relative to ?): (? = 10,000,m = 10 1000) Use logistic regression or SVM without a kernel ( linear kernel ) 2. If ? is small, ? is intermediate: (? = 1 1000,m = 10 10,000) Use SVM with Gaussian kernel 3. If ? is small, ? is large: (? = 1 1000,m = 50,000+) Create/add more features, then use logistic regression of linear SVM Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng

Things to remember Cost function ? ? +1 2 ?(?) cost1? ?? + (1 ??) cost0? ?? min ? ? 2 ?=1 ?? ?=1 Large margin classification Kernels ?2 Using an SVM margin ?1

Advanced Techniques in Machine Learning

Download Presentation

Presentation Transcript

Related

More Related Content