
Optimizing Constrained Convex Functions for Data Science Success
Explore the principles of constrained convex optimization, gradient descent, boosting, and learning from experts in the realm of data science. Unravel the complexities of non-convex optimization, knapsack problems, and the power of convex multivariate functions. Delve into examples of convex functions and learn to apply them in real-world scenarios like linear equations and SVMs. Discover the magic of gradient descent in solving constrained convex optimization problems efficiently and effectively.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CSCI B609: Foundations of Data Science Lecture 13/14: Gradient Descent, Boosting and Learning from Experts Slides at http://grigory.us/data-science-class.html Grigory Yaroslavtsev http://grigory.us
Constrained Convex Optimization Non-convex optimization is NP-hard: 21 ?? 2= 0 ?: ?? 0,1 ?? ? Knapsack: Minimize ????? Subject to: ????? ? Convex optimization can often be solved by ellipsoid algorithm in ????(?) time, but too slow
Convex multivariate functions Convexity: ?,? ?: ? ? ? ? + ? ? ?f(y) ?,? ?, 0 ? 1: ? ?? + 1 ? ? ?? ? + 1 ? ?(?) If higher derivatives exist: ? ? = ? ? + ?? ? ? ? + ? ???2? ? ? ? + ?2? ?????? is the Hessian matrix ? is convex iff it s Hessian is positive semidefinite, ???2?? 0 for all ?. ?2? ???=
Examples of convex functions ?-norm is convex for 1 ? : ?? + 1 ? ? ? ?? ?+ 1 ? ? ? = ? ? ?+ 1 ? ? ? ? ? = log(??1+ ??2+ ???) max ?1, ,?? ? ? max ?1, ,?? + log? ? ? = ???? where ? is a p.s.d. matrix, ?2? = ? Examples of constrained convex optimization: (Linear equations with p.s.d. constraints): minimize: 1 (Least squares regression): 2= xTATA x 2 AxTb + bTb 2???? ??? (solution satisfies ?? = ?) Minimize: ?? ? 2
Constrained Convex Optimization General formulation for convex ? and a convex set ?: ????????:? ? subject to: ? ? Example (SVMs): Data: ?1, ,?? ? labeled by ?1, ,?? 1,1 (spam / non-spam) Find a linear model: ? ?? 1 ?? is spam ? ?? 1 ?? is non-spam ?:1 ????? 0 More robust version: ????????: ???? 1 ? ???? + ? ? 2 ? E.g. hinge loss Loss(t)=max(0,t) Another regularizer: ? ? 1 (favors sparse solutions)
Gradient Descent for Constrained Convex Optimization (Projection): ? ? ? ? y = argminz ? ? ? 2: ? = ?/ ? 2 2 Easy to compute for 2 2 Let ?? ? 2 ?, max ? ? ?. 2 ?,? ? Let ? =4?2?2 Gradient descent (gradient + projection oracles): Let ? = ?/? ? Repeat for ? = 0, ,?: ?(?+1)= ?(?) ??? ?? ?(?+1)= projection of ?(?+1) on ? Output ? =1 ?2 ? ??(?)
Gradient Descent for Constrained Convex Optimization 2 2 ??+1 ? ??+1 ? 2 2 2 ?? ? ??? ?? = 2 2 2 ?? ? + ?2 ?? ?? 2??? ?? ?? ? = 2 2 Using definition of ?: 1 2? +? 2 2 ?? ?? ?? ? ?? ? ??+1 ? 2?2 2 2 2 2 1 +? ? ?? ? ? ?? ? ??+1 ? 2?2 2? 2 2 Sum over ? = 1, ,?: ? ? ?? 1 2? +?? 2 2 ? ? ?0 ? ?? ? 2?2 2 2 ?=1
Gradient Descent for Constrained Convex Optimization 2 1 ? ? ?? ? ? ?0 ? ?=1 2? 2 2 +?? ?? ? 2?2 2 1 1 ? ??? ? ?? ?? ? : ?2 2??+? 1 ? ?? ? ? 2?2 ? ? ? ? RHS ?? ? Set ? = ? ?
Online Gradient Descent Gradient descent works in a more general case: ? sequence of convex functions ?1,?2 ,?? At step ? need to output ?(?) ? Let ? be the minimizer of ???(?) Minimize regret: ???? ??(? ) ? Same analysis as before works in online case.
Stochastic Gradient Descent (Expected gradient oracle): returns ? such that ??? = ?? ? . Example: for SVM pick randomly one term from the loss function. Let ?? be the gradient returned at step i Let ??= ?? step of OGD Let ? =1 ?? be the function used in the i-th ? ??(?) and ? be the minimizer of ?.
Stochastic Gradient Descent ? ? +?? any gradient output by oracle. ? ? ? ? 1 ? ? = ? ??[?? =1 ? ? =1 ??[ ? ?[] = regret of OGD , always ? Thm. ? ? ? ? where ? is an upper bound of 1 ? ?(? ?? ?(? )) (convexity) ?? ?? ?? ? 1 ?(?? ? )] (grad. oracle) ?[??(??) ??(? )] ??(??) ??(? )]
VC-dim of combinations of concepts For ? concepts 1, , ? + a Boolean function ?: ????? 1, ? = {? ?:? 1? , ?? = 1} Ex: ? = lin. separators, ? = ??? / ? = ???????? For a concept class ? + a Boolean function ?: ?????,?? = {????? 1, ?: ? ?} Lem. If ??-dim(?)= ? then for any ?: ??-dim ?????,?? ?(?? log(??))
VC-dim of combinations of concepts Lem. If ??-dim(?)= ? then for any ?: ??-dim ?????,?? ?(?? log(??)) Let ? = ?? dim ?????,?? set ? of ? points shattered by ?????,?? Sauer s lemma ?? ways of labeling ? by ? Each labeling in ?????,?? determined by ? labelings of ? by ? (??)?= ??? labelings 2? ??? ? ??log? ? 2??log??
Back to the batch setting Classification problem Instance space ?: 0,1? or ? (feature vectors) Classification: come up with a mapping ? {0,1} Formalization: Assume there is a probability distribution ? over ? ? = target concept (set ? ? of positive instances) Given labeled i.i.d. samples from ? produce ? ? Goal: have ?agree with ? over distribution ? Minimize: ????? = Pr ?[? ? ] ?????= true or generalization error
Boosting Strong learner: succeeds with prob. 1 ? Weak learner: succeeds with prob. 1 2+ ? Boosting (informal): weak learner that works under any distribution strong learner Idea: run weak leaner ? on sample ? under reweightings focusing on misclassified examples
Boosting (cont.) ? = class of hypothesis produced by ? Apply majority rule to 1, , ?? ?: VC-dim ?(??VC dim ? log(??VC dim ? )) Algorithm: Given ? = (?1, ,??) set ??= 1 in ? = (?1, ,??) For ? = 1, ,??do: Call weak learner on ?,? hypothesis ? For misclassified ?? multiply ?? by ? = (1 Output: MAJ( 1, , ??) 2+ ?)/(1 2 ?)
Boosting: analysis Def (?-weak learner on sample): For labeled examples ?? weighted by ?? with weight of correct 1 2+ ? ?=1 Thm. If ? is ?-weak learner on ? 1 ?2??? ?boosting achieves 0 error on ?. Proof. m = # mistakes of the final classifier Each was misclassified ?? Total weight ????/2 Total weight at ? = W t ? ?? for ??= ? 2times weight ???/2 1 2 ? + 1 2+ ? ? ? + 1 ? ? ? = 1 + 2? ?(?)
Boosting: analysis (cont.) ? 0 = ? ? ?? ? 1 + 2??? ????/2 ? ?? ? 1 + 2??? ? = (1 2 ?)= (1 + 2?)/(1 2?) ? ? 1 2???/21 + 2???/2=? 1 4?2 ??/2 1 ? ? ? ? ?? 2?2?? ??= ? 2+ ?)/(1 1 ?2log? Comments: Applies even if the weak learners are adversarial 1 ? ?? dim(?) ?2 VC-dim bounds ? = ?