Introduction to Statistical Estimation in Machine Learning

ECE 5984: Introduction to
Machine Learning
Dhruv Batra
Virginia Tech
Topics:
Statistical Estimation (MLE, MAP, Bayesian)
Readings: Barber 8.6, 8.7
Administrativia
HW0
Solutions available
HW1
Due on Sun 02/15, 11:55pm
Project Proposal
Due: Tue 02/24, 11:55 pm
<=2pages, NIPS format
http://inclass.kaggle.com/c/VT-ECE-Machine-Learning-HW1
(C) Dhruv Batra
2
 
Recap from last time
(C) Dhruv Batra
3
Procedural View
Training Stage:
Raw Data 
 x 
   
           (Feature Extraction)
Training Data { (x,y) } 
 f 
   
(Learning)
Testing Stage
Raw Data 
 x 
   
           (Feature Extraction)
Test Data x 
 f(x) 
 
      (Apply function, Evaluate error)
(C) Dhruv Batra
4
Statistical Estimation View
Probabilities to rescue:
x and y are 
random
 
variables
D = (x
1
,y
1
), (x
2
,y
2
), …, (x
N
,y
N
) 
 
~ P(X,Y)
IID: Independent Identically Distributed
Both training & testing data sampled IID from P(X,Y)
Learn on training set
Have some hope of 
generalizing
 to test set
(C) Dhruv Batra
5
Interpreting Probabilities
What does P(A) mean?
Frequentist View
limit N
#(A is true)/N
limiting frequency of a repeating non-deterministic event
Bayesian View
P(A) is your “belief” about A
Market Design View
P(A) tells you how much you would bet
(C) Dhruv Batra
6
Concepts
Marginal distributions / Marginalization
Conditional distribution / Chain Rule
Bayes Rule
(C) Dhruv Batra
7
Concepts
Likelihood
How much does a certain hypothesis explain the data?
Prior
What do you believe before seeing any data?
Posterior
What do we believe after seeing the data?
(C) Dhruv Batra
8
KL-Divergence / Relative Entropy
(C) Dhruv Batra
9
Slide Credit: Sam Roweis
Plan for Today
Statistical Learning
Frequentist Tool
Maximum Likelihood
Bayesian Tools
Maximum A Posteriori
Bayesian Estimation
Simple examples (like coin toss)
But SAME concepts will apply to sophisticated problems.
(C) Dhruv Batra
10
Your first probabilistic learning algorithm
 
After taking this ML class, you drop out of VT and join
an illegal betting company.
 
Your new boss asks you:
If Novak Djokovic & Rafael Nadal play tomorrow, will Nadal
win or lose W/L?
 
You say: what happened in the past?
W, L, L, W, W
 
You say: P(Nadal Wins) = …
 
Why?
(C) Dhruv Batra
11
 
(C) Dhruv Batra
12
Slide Credit: Yaser Abu-Mostapha
Maximum Likelihood Estimation
 
Goal: Find a good θ
 
What’s a good θ?
One that makes it likely for us to have seen this data
Quality of θ = Likelihood(θ; D) = P(data | θ)
(C) Dhruv Batra
13
Sufficient Statistic
 
 
 
 
D
1
 = {1,1,1,0,0,0}
D
2
 = {1,0,1,0,1,0}
 
A function of the data ϕ(Y) is a sufficient statistic, if
the following is true
(C) Dhruv Batra
14
Why Max-Likelihood?
 
Leads to “natural” estimators
 
MLE is OPT if model-class is correct
Log-likelihood is same as cross-entropy
Relate cross-entropy to KL
(C) Dhruv Batra
15
16
How many flips do I need?
Joke Credit: Carlos Guestrin
 
 
 
 
 
Boss says: Last year:
3 heads/wins-for-Nadal
2 tails/losses-for-Nadal.
You say: 
 = 3/5, I can prove it!
 
He says: What if
30 heads/wins-for-Nadal
20 tails/losses-for-Nadal.
You say: Same answer, I can prove it!
 
H
e
 
s
a
y
s
:
 
W
h
a
t
s
 
b
e
t
t
e
r
?
You say: Humm… The more the merrier???
He says: Is this why I am paying you the big bucks???
Bayesian Estimation
Boss says: What is I know Nadal is a better player on
clay courts?
You say: Bayesian it is then..
(C) Dhruv Batra
17
Priors
 
What are priors?
Express beliefs before experiments are conducted
Computational ease: lead to “good” posteriors
Help deal with unseen data
Regularizers: More about this in later lectures
 
Conjugate Priors
Prior is conjugate to likelihood if it leads to itself as posterior
Closed form representation of posterior
(C) Dhruv Batra
18
19
Beta prior distribution – P(
)
Demo:
http://demonstrations.wolfram.com/BetaDistribution/
Slide Credit: Carlos Guestrin
 
Benefits of conjugate priors
(C) Dhruv Batra
20
21
MAP for Beta distribution
 
 
 
 
 
MAP: use most likely parameter:
 
 
 
Beta prior equivalent to extra W/L matches
As 
N
 
 
inf
, prior is “forgotten”
B
u
t
,
 
f
o
r
 
s
m
a
l
l
 
s
a
m
p
l
e
 
s
i
z
e
,
 
p
r
i
o
r
 
i
s
 
i
m
p
o
r
t
a
n
t
!
Slide Credit: Carlos Guestrin
Effect of Prior
Prior = Beta(2,2)
θ
prior
 = 0.5
Dataset = {H}
L(θ) = θ
θ
MLE
 = 1
Posterior = Beta(3,2)
θ
MAP
 = (3-1)/(3+2-2) = 2/3
(C) Dhruv Batra
22
23
What you need to know
Statistical Learning:
Maximum likelihood
Why MLE?
Sufficient statistics
Maximum a posterori
Bayesian estimation (return an entire distribution)
Priors, posteriors, conjugate priors
Beta distribution (conjugate of bernoulli)
Slide Note
Embed
Share

Explore the fundamental concepts of statistical estimation in machine learning, including Maximum Likelihood Estimation (MLE), Maximum A Posteriori (MAP), and Bayesian estimation. Learn about key topics such as probabilities, interpreting probabilities from different perspectives, marginal distributions, conditional distributions, likelihood, prior, posterior, and KL-Divergence. Dive into the applications of these concepts in training and testing stages of machine learning models.

  • Machine Learning
  • Statistical Estimation
  • Probabilities
  • Bayesian Estimation
  • Data Analysis

Uploaded on Sep 25, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. ECE 5984: Introduction to Machine Learning Topics: Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7 Dhruv Batra Virginia Tech

  2. Administrativia HW0 Solutions available HW1 Due on Sun 02/15, 11:55pm http://inclass.kaggle.com/c/VT-ECE-Machine-Learning-HW1 Project Proposal Due: Tue 02/24, 11:55 pm <=2pages, NIPS format (C) Dhruv Batra 2

  3. Recap from last time (C) Dhruv Batra 3

  4. Procedural View Training Stage: Raw Data x Training Data { (x,y) } f (Feature Extraction) (Learning) Testing Stage Raw Data x Test Data x f(x) (Apply function, Evaluate error) (Feature Extraction) (C) Dhruv Batra 4

  5. Statistical Estimation View Probabilities to rescue: x and y are randomvariables D = (x1,y1), (x2,y2), , (xN,yN) ~ P(X,Y) IID: Independent Identically Distributed Both training & testing data sampled IID from P(X,Y) Learn on training set Have some hope of generalizing to test set (C) Dhruv Batra 5

  6. Interpreting Probabilities What does P(A) mean? Frequentist View limit N #(A is true)/N limiting frequency of a repeating non-deterministic event Bayesian View P(A) is your belief about A Market Design View P(A) tells you how much you would bet (C) Dhruv Batra 6

  7. Concepts Marginal distributions / Marginalization Conditional distribution / Chain Rule Bayes Rule (C) Dhruv Batra 7

  8. Concepts Likelihood How much does a certain hypothesis explain the data? Prior What do you believe before seeing any data? Posterior What do we believe after seeing the data? (C) Dhruv Batra 8

  9. KL-Divergence / Relative Entropy (C) Dhruv Batra Slide Credit: Sam Roweis 9

  10. Plan for Today Statistical Learning Frequentist Tool Maximum Likelihood Bayesian Tools Maximum A Posteriori Bayesian Estimation Simple examples (like coin toss) But SAME concepts will apply to sophisticated problems. (C) Dhruv Batra 10

  11. Your first probabilistic learning algorithm After taking this ML class, you drop out of VT and join an illegal betting company. Your new boss asks you: If Novak Djokovic & Rafael Nadal play tomorrow, will Nadal win or lose W/L? You say: what happened in the past? W, L, L, W, W You say: P(Nadal Wins) = Why? (C) Dhruv Batra 11

  12. (C) Dhruv Batra 12 Slide Credit: Yaser Abu-Mostapha

  13. Maximum Likelihood Estimation Goal: Find a good What s a good ? One that makes it likely for us to have seen this data Quality of = Likelihood( ; D) = P(data | ) (C) Dhruv Batra 13

  14. Sufficient Statistic D1 = {1,1,1,0,0,0} D2 = {1,0,1,0,1,0} A function of the data (Y) is a sufficient statistic, if the following is true i D2 f(yi) = f(yi) L(q;D1)= L(q;D2) i D1 (C) Dhruv Batra 14

  15. Why Max-Likelihood? Leads to natural estimators MLE is OPT if model-class is correct Log-likelihood is same as cross-entropy Relate cross-entropy to KL (C) Dhruv Batra 15

  16. How many flips do I need? Boss says: Last year: 3 heads/wins-for-Nadal 2 tails/losses-for-Nadal. You say: = 3/5, I can prove it! He says: What if 30 heads/wins-for-Nadal 20 tails/losses-for-Nadal. You say: Same answer, I can prove it! He says: What s better? You say: Humm The more the merrier??? He says: Is this why I am paying you the big bucks??? 16 Joke Credit: Carlos Guestrin

  17. Bayesian Estimation Boss says: What is I know Nadal is a better player on clay courts? You say: Bayesian it is then.. (C) Dhruv Batra 17

  18. Priors What are priors? Express beliefs before experiments are conducted Computational ease: lead to good posteriors Help deal with unseen data Regularizers: More about this in later lectures Conjugate Priors Prior is conjugate to likelihood if it leads to itself as posterior Closed form representation of posterior (C) Dhruv Batra 18

  19. Beta prior distribution P() Demo: http://demonstrations.wolfram.com/BetaDistribution/ 19 Slide Credit: Carlos Guestrin

  20. Benefits of conjugate priors (C) Dhruv Batra 20

  21. MAP for Beta distribution MAP: use most likely parameter: Beta prior equivalent to extra W/L matches As N inf, prior is forgotten But, for small sample size, prior is important! 21 Slide Credit: Carlos Guestrin

  22. Effect of Prior Prior = Beta(2,2) prior = 0.5 Dataset = {H} L( ) = MLE = 1 Posterior = Beta(3,2) MAP = (3-1)/(3+2-2) = 2/3 (C) Dhruv Batra 22

  23. What you need to know Statistical Learning: Maximum likelihood Why MLE? Sufficient statistics Maximum a posterori Bayesian estimation (return an entire distribution) Priors, posteriors, conjugate priors Beta distribution (conjugate of bernoulli) 23

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#