Overview of Linear Regression in Machine Learning

undefined
 
Linear Regression
 
CS771: Introduction to Machine Learning
Nisheeth
 
Linear regression is like fitting a line or (hyper)plane to a set of points
 
 
 
 
 
 
 
 
 
The line/plane must also predict outputs the unseen (test) inputs well
 
 
 
 
 
 
Linear Regression: Pictorially
2
 
(Feature 1)
 
(Feature 2)
 
What if a line/plane doesn’t
model the input-output
relationship very well, e.g., if
their relationship is better
modeled by a nonlinear
curve or curved surface?
Do linear models
become useless
in such cases?
 
Original (single) feature
Nonlinear curve needed
 
      Two features
Can fit a plane (linear)
 
This is the base model for 
all
 statistical
machine learning
x
 is 
a one feature data variable
y
 is the value we are trying to predict
The regression model is
 
Two parameters to estimate – the slope of
the line w
1
 and the 
y
-intercept w
0
ε is the unexplained, random, or
error component.
 
Simplest
 Possible 
Linear Regression Model
 
We basically want to find {w0, w1} that
minimize deviations from the predictor
line
 
 
How do we do it?
Iterate over all possible w values
along the two dimensions?
Same, but smarter? [next class]
No, we can do this in 
closed form 
with
just plain calculus
Very few optimization problems in ML
have closed form solutions
The ones that do are interesting for
that reason
 
Solving the regression problem
 
We just need to set the partial derivatives
to zero (
full derivation
)
 
 
 
 
 
Simplifying
 
Parameter
 estimation via calculus
More generally
6
Unlike models like KNN and DT, here we
have an 
explicit problem-specific objective 
(loss function) 
that we wish to optimize for
Linear Regression with Squared Loss
7
The 
“least squares” (LS) 
problem
Gauss-Legendre, 18
th
 century)
Link to a nice
derivation
Alternative loss functions
 
Many possible loss functions for regression problems
8
 
Loss
 
Loss
 
Loss
 
Loss
Squared loss
Absolute loss
Huber loss
Choic
e of loss function usually
depends on the nature of the
data. Also, some loss functions
result in easier optimization
problem than others
Grows more slowly than
squared loss. Thus better
suited when data has some
outliers (inputs 
on which
model makes 
large errors)
Very commonly used
for regression. Leads
to an easy-to-solve
optimization problem
Note: Can also use
squared loss instead
of absolute loss
How do we ensure generalization?
9
Regularized Least Squares (a.k.a. Ridge Regression)
10
Why is the method
called “ridge” regression
11
Remember – in general,
weights with large magnitude
are bad since they can cause
overfitting on training data and
may not work well on test data
 
1.2
 
0.5
 
2.4
 
0.3
 
0.8
 
0.1
 
0.9
 
2.1
 
1.2
 
0.5
 
2.4
 
0.3
 
0.1
 
0.9
 
2.1
 
Exact same feature vectors only differing
in just one feature by a small amount
 
Very different outputs though (maybe one
of these two training ex. is an outlier)
 
10000
 
3.2
 
1.8
 
1.3
 
2.1
 
2.5
 
3.1
 
0.1
Just to fit the training data where one of the
inputs was possibly an outlier, this weight
became too big. Such a weight vector will
possibly do poorly on normal test inputs
Not a “smooth” model
since its test data
predictions may change
drastically even with
small changes in some
feature’s value
Other Ways to Control Overfitting
12
Use them if you have a very
large number of features but
many irrelevant features. These
regularizers can help in
automatic feature selection
All of these are very popular ways to
control overfitting in deep learning
models. More on these later when
we talk about deep learning
Note that optimizing loss functions
with such regularizers is usually
harder than ridge reg. but several
advanced techniques exist (we will
see some of those later)
Linear Regression as Solving System of Linear Eqs
13
 
First training example:
 
Second training example:
 
N-th training example:
 
Solving lin-reg
as system of lin eq.
Now solve this!
 
Next Lecture
 
14
 
Solving linear regression using iterative optimization methods
Faster and don’t require matrix inversion
Brief intro to optimization techniques
 
 
Slide Note
Embed
Share

Linear regression is a fundamental concept in machine learning where a line or plane is fitted to a set of points to model the input-output relationship. It discusses fitting linear models, transforming inputs for nonlinear relationships, and parameter estimation via calculus. The simplest linear regression model, solving the regression problem, and the goal of learning a model for prediction are also covered.

  • Linear Regression
  • Machine Learning
  • Regression Model
  • Parameter Estimation
  • Model Fitting

Uploaded on Jul 22, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Linear Regression CS771: Introduction to Machine Learning Nisheeth

  2. 2 Linear Regression: Pictorially Linear regression is like fitting a line or (hyper)plane to a set of points ?1 ?2 = ?(?) What if a line/plane doesn t model the input-output relationship very well, e.g., if their relationship is better modeled by a nonlinear curve or curved surface? Two features Can fit a plane (linear) Original (single) feature Nonlinear curve needed (Output ?) (Output ?) No. We can even fit a curve using a linear model after suitably transforming the inputs ? ? ?(?) Do linear models become useless in such cases? (Feature 2) (Feature 1) Input ? (single feature) The transformation ? . can be predefined or learned (e.g., using kernel methods or a deep neural network based feature extractor). More on this later The line/plane must also predict outputs the unseen (test) inputs well CS771: Intro to ML

  3. Simplest Possible Linear Regression Model This is the base model for all statistical machine learning x is a one feature data variable y is the value we are trying to predict The regression model is + = w y 0 + w x 1 Two parameters to estimate the slope of the line w1 and the y-intercept w0 is the unexplained, random, or error component. CS771: Intro to ML

  4. Solving the regression problem We basically want to find {w0, w1} that minimize deviations from the predictor line How do we do it? Iterate over all possible w values along the two dimensions? Same, but smarter? [next class] No, we can do this in closed form with just plain calculus Very few optimization problems in ML have closed form solutions The ones that do are interesting for that reason CS771: Intro to ML

  5. Parameter estimation via calculus We just need to set the partial derivatives to zero (full derivation) Simplifying CS771: Intro to ML

  6. 6 More generally ? , ?? ?,?? Given: Training data with ? input-output pairs {(?n,?n)}?=1 Goal: Learn a model to predict the output for new test inputs Assume the function that approximates the I/O relationship to be a linear model ?? ? ?? = ? ?? (? = 1,2, ,?) Can also write all of them compactly using matrix-vector notation as ? ?? Let s write the total error or loss of this model over the training data as (??,? ??) measures the prediction error or loss or deviation of the model on a single training input (?n,?n) ? (??, ? ??) Goal of learning is to find the ? that minimizes this loss + does well on test data ? ? = ?=1 Unlike models like KNN and DT, here we have an explicit problem-specific objective (loss function) that we wish to optimize for CS771: Intro to ML

  7. 7 Linear Regression with Squared Loss In matrix-vector notation, can write it compactly as ? ??2 In this case, the loss func will be 2= (? ??) (? ??) ? (?? ? ??)2 ? ? = ?=1 Let us find the ? that optimizes (minimizes) the above squared loss The least squares (LS) problem Gauss-Legendre, 18th century) We need calculus and optimization to do this! The LS problem can be solved easily and has a closed form solution ???= arg min? ? ? = arg min? ?=1 Link to a nice derivation ? (?? ? ??)2 ) 1( ?=1 ? ? = (? ?) 1? ? ???= ( ?=1 ?? ?? ????) ? ? matrix inversion can be expensive. Ways to handle this. Will see later CS771: Intro to ML

  8. 8 Alternative loss functions Choice of loss function usually depends on the nature of the data. Also, some loss functions result in easier optimization problem than others Many possible loss functions for regression problems Squared loss Absolute loss (?? ?(??))2 Loss |?? ? ??| Loss Very commonly used for regression. Leads to an easy-to-solve optimization problem Grows more slowly than squared loss. Thus better suited when data has some outliers (inputs on which model makes large errors) ?? ?(??) ?? ?(??) Loss Huber loss |?? ? ??| ? Loss ?-insensitive loss (a.k.a. Vapnik loss) Squared loss for small errors (say up to ?); absolute loss for larger errors. Good for data with outliers Note: Can also use squared loss instead of absolute loss Zero loss for small errors (say up to ?); absolute loss for larger errors ? ? ?? ?(??) ? ? ?? ?(??) CS771: Intro to ML

  9. 9 How do we ensure generalization? ? (?? ? ??)2 w.r.t. ? and got We minimized the objective ? ? = ?=1 ) 1( ?=1 ? ? ????) = (? ?) 1? ? ???= ( ?=1 ?? ?? Problem: The matrix ? ? may not be invertible This may lead to non-unique solutions for ???? Problem: Overfitting since we only minimized loss defined on training data Weights ? = [?1,?2, ,??] may become arbitrarily large to fit training data perfectly Such weights may perform poorly on the test data however ? ? is called the Regularizer and measures the magnitude of ? One Solution: Minimize a regularized objective The reg. will prevent the elements of ? from becoming too large Reason: Now we are minimizing training error + magnitude of vector ? regularized objective ? ? +? ? ? ? 0 is the reg. hyperparam. Controls how much we wish to regularize (needs to be tuned via cross-validation) CS771: Intro to ML

  10. 10 Regularized Least Squares (a.k.a. Ridge Regression) Recall that the regularized objective is of the form ????(?) = ? ? + ? ? ? One possible/popular regularizer: the squared Euclidean ( 2 squared) norm of ? ? ? = ?2 2= ? ? With this regularizer, we have the regularized least squares problem as Look at the form of the solution. We are adding a small value ? to the diagonals of the DxD matrix ? ?(like adding a ridge/mountain to some land) ??????= arg min? ? ? + ? ? ? ? Why is the method called ridge regression (?? ? ??)2+ ?? ? = arg min? ?=1 Proceeding just like the LS case, we can find the optimal ? which is given by + ???) 1( ?=1 ? ? = (? ? + ???) 1? ? ??????= ( ?=1 ?? ?? ????) CS771: Intro to ML

  11. 11 A closer look at 2 regularization Remember in general, weights with large magnitude are bad since they can cause overfitting on training data and may not work well on test data The regularized objective we minimized is ? (?? ? ??)2+ ?? ? ????? = ?=1 Minimizing ????? w.r.t. ? gives a solution for ? that Keeps the training error small Has a small 2 squared norm ? ? = ?=1 Good because, consequently, the individual entries of the weight vector ? are also prevented from becoming too large Not a smooth model since its test data predictions may change drastically even with small changes in some feature s value ? 2 ?? Small entries in ?are good since they lead to smooth models A typical ? learned without 2 reg. ??= 0.8 ??= 2.4 0.8 0.1 0.9 2.1 0.3 1.2 0.5 2.5 3.2 1.3 10000 1.8 3.1 0.1 2.1 ??= 100 ??= 0.8 + ? 2.4 0.1 0.9 2.1 0.3 1.2 0.5 Just to fit the training data where one of the inputs was possibly an outlier, this weight became too big. Such a weight vector will possibly do poorly on normal test inputs Very different outputs though (maybe one of these two training ex. is an outlier) Exact same feature vectors only differing in just one feature by a small amount CS771: Intro to ML

  12. Note that optimizing loss functions with such regularizers is usually harder than ridge reg. but several advanced techniques exist (we will see some of those later) 12 Other Ways to Control Overfitting Use a regularizer ? ? defined by other norms, e.g., Use them if you have a very large number of features but many irrelevant features. These regularizers can help in automatic feature selection 1 norm regularizer ? ?1= |??| When should I used these regularizers instead of the 2 regularizer? ?=1 sparse means many entries in ? will be zero or near zero. Thus those features will be considered irrelevant by the model and will not influence prediction ?0= #nnz(?) Using such regularizers gives a sparse sparse weight vector ? as solution 0 norm regularizer (counts number of nonzeros in ? Use non-regularization based approaches Early-stopping (stopping training just when we have a decent val. set accuracy) Dropout (in each iteration, don t update some of the weights) Injecting noise in the inputs All of these are very popular ways to control overfitting in deep learning models. More on these later when we talk about deep learning CS771: Intro to ML

  13. 13 Linear Regression as Solving System of Linear Eqs The form of the lin. reg. model ? ?? is akin to a system of linear equation Assuming ? training examples with ? features each, we have ?1= ?11?1+ ?12?2+ + ?1??? ?2= ?21?1+ ?22?2+ + ?2??? Second training example: Note: Here ??? denotes the ?? feature of the ?? training example First training example: ? equations and ? unknowns here (?1,?2, ,??) ??= ??1?1+ ??2?2+ + ????? N-th training example: However, in regression, we rarely have ? = ? but rather ? > ? or ? < ? Thus we have an underdetermined (? < ?) or overdetermined (? > ?) system Methods to solve over/underdetermined systems can be used for lin-reg as well Many of these methods don t require expensive matrix inversion ? = (? ?) 1? ? as system of lin eq. Now solve this! Solving lin-reg ?? = ?where ? = ? ?, and ? = ? ? System of lin. Eqns with ? equations and ? unknowns CS771: Intro to ML

  14. 14 Next Lecture Solving linear regression using iterative optimization methods Faster and don t require matrix inversion Brief intro to optimization techniques CS771: Intro to ML

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#