Online Learning in Machine Learning

 
Online Learning: An
Introduction
 
Travis Mandel
University of Washington
Overview of Machine Learning
 
Supervised Learning
Decision Trees, Naïve Bayes,
Linear Regression, K-NN, etc.
Unsupervised Learning
K-means, Association Rule
Learning, etc.
 
What else is there????
Overview of Machine Learning
 
Supervised Learning
 
 
 
Unsupervised Learning
 
 
Is this the best way to learn?
Labeled Training Data
ML
Algorithm
Classifier/
Regression
function
Unlabeled Training Data
ML
Algorithm
Clusters/Rules
Overview of Machine Learning
 
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Active Learning
Interactive Learning
Online Learning
Reinforcement Learning
Online Classification
Observe  One Training
Instance
ML
Algorithm
Issues
prediction
Observes Label,
Receives
Loss/Reward
 
More generally… Online Learning!
Observe observations
ML
Algorithm
Selects
output
Observe
Loss/Reward
Function
Receives
Loss/Reward
 
Typical Assumption: Statistical Feedback
 
Instances and labels drawn from a fixed
distribution
 D
 
Recall: Distribution
Alternate Assumption: Adversarial
Feedback
 
Instances and labels drawn adversarially
 
Spam detection, anomaly detection,
 
etc.
 
Change over time a HUGE Issue!
 
 
 
 
Change over time example
 
 
 
How can we measure
performance?
 
What we want to maximize: Sum of
rewards
 
Why could this be a problem?
Alternative: Discounted Sum of
Rewards
 
One problem…
 
If a small percent of times you never find the best option
(Incomplete learning), it doesn’t matter much!
 
But intuitively it should matter a LOT!
Regret
 
Sum of rewards compared to the
BEST choice in hindsight!
 
 
If training a linear model, optimal
weights with overfitting!
See other examples in a second
 
Can get good regret even if perform
very poorly in absolute terms
We really like theoretical
guarantees!
 
Anyone know why?
 
Running online 
 uncertain future
Hard to tell if you reached best
possible performance by looking at
data
 
Recall…
Observe observations
ML
Algorithm
Selects
output
Observe
Loss/Reward
Function
Receives
Loss/Reward
Experts Setting
Observe  expert
recommendations
ML
Algorithm
Selects
Expert
 
N experts
 
Assume
rewards in [0,1]
Observe
Loss/Reward
For each expert
Receives
Loss/Reward
Regret in the Experts Setting
 
Do as well as the best expert in
hindsight!
 
Hopefully you have (at least some)
good experts!
“Choose an Expert?”
 
Each expert could make a
prediction (different ML models,
classifiers, hypotheses, etc.)
 
Each expert could tell us to perform
an action (such as buy a stock)
 
Experts can learn & change over
time!
Simple strategies
 
Pick Expert that’s done the best in
the past?
 
Pick Expert that’s done the best
recently?
No! Adversarial!
 
For ANY deterministic strategy…
 
The adversary can know which
expert we will choose, and give it
low reward/high loss, but all others
high reward/low loss!
 
Every step, get further away from
best expert 
 Linear regret 
Solution: Randomness
 
What if the algorithm has access to
some random bits?
 
We can do better!
 
EG: Exponentiated Gradient
 
Probabilities are normalized weights
 
EG: Regret Bound
 
Problem: What if the feedback is
a little different?
 
In the real world, we usually don’t
get to see what “would have”
happened
Shift from predicting future to
taking actions in an unknown
world
Simplest Reinforcement Learning
problem
Multi-Armed Bandit (MAB)
Problem
 
N arms
 
Assume
rewards in [0,1]
 
Assume
Statistical feedback
Observes  Loss/Reward
For The Chosen Arm
ML
Algorithm
Pulls Arm
Suffers
Loss/Reward
Regret in the MAB Setting
 
Since we are stochastic, each arm
has a true mean
 
Do as well as if pulled always
pulled arm with highest mean
MABs have a huge number of
applications!
Internet Advertising
Website Layout
Clinical Trials
Robotics
Education
Videogames
Recommender Systems
 
Unlike most of Machine Learning, how to ACT,
not just how to predict!
 
A surprisingly tough problem…
 
“The problem is a classic one; it was
formulated during the war, and efforts to
solve it so sapped the energies and minds of
Allied analysts that the suggestion was made
that the problem be dropped over Germany,
as the ultimate instrument of intellectual
sabotage. “
 
Prof. Peter Whittle,  
Journal of the Royal
Statistical Society
, 
1979
MAB algorithm #1: 
ε
-first
 
Phase 1: Explore arms uniformly at
random
Phase 2: Exploit best arm so far
Only 1 parameter: how long to
wait!
 
Problem?
Not very efficient
Makes statistics easy to run
MAB algorithm #2: 
ε
-greedy
 
Flip a coin:
With probability 
ε
, explore
uniformly
With probability 1-
 ε
, exploit
Only 1 parameter:
 ε
 
Problem?
Often used in robotics and
reinforcement learning
Idea
 
The problem is uncertainty…
How to quantify uncertainty? Error
bars
 
Given Error bars, how do we act?
Given Error bars, how do we act?
 
Optimism under uncertainty!
Why?  If bad, we will soon find out!
One last wrinkle
 
Upper Confidence Bound (UCB)
 
1. Play each arm once
 
2. Play arm i that maximizes:
 
3. Repeat Step 2 forever
 
UCB Regret Bound
 
Problem –independent:
 
Problem –dependent:
 
Can’t do much better!
A little history…
 
William R. Thompson (1933): Was the first to examine MAB
 
problem, proposed a method for solving them
 
1940s-50s: MAB problem studied intentively during WWII,
 
Thompson ignored
 
1970’s-1980’s: “Optimal” solution (Gittins index) found but
 
is intractable and incomplete.  Thompson ignored.
 
2001: UCB proposed, gains widespread use due to simplicity
 
and “optimal”  bounds. Thompson still ignored.
 
2011: Empricial results show Thompson’s 1933 method beats
 
UCB, but little interest since no guarantees.
 
2013: Optimal bounds finally shown for Thompson Sampling
 
Thompson’s method was
fundamentally different!
Bayesian vs Frequentist
Bayesians: You have a prior,
probabilities interpreted as beliefs,
prefer probabilistic decisions
Frequentists: No prior, probabilities
interpreted as facts about the world,
prefer hard decisions (p<0.05)
 
UCB is a frequentist technique! What if we are Bayesian?
Bayesian review: Bayes’ Rule
 
Likelihood
 
Prior
 
Posterior
Bernoulli Case
 
What if distribution in the set {0,1}
 
instead of the range [0,1]?
 
Then we flip a coin with probability p 
 Bernoulli
 
Distribution!
 
To estimate p, we count up numbers of ones and zeros
 
Given observed ones and zeroes, how do we calculate
 the  distibution of possible values of p?
 
Beta-Bernoulli Case
 
Beta(a,b)  
  Given a 0’s and b 1’s, what is the
distribution over means?
 
Prior  pseudocounts
 
 
 
 
 
Likelihood  Observed counts
 
Posterior  psuedocounts + observed counts
 
How does this help us?
 
Thompson Sampling:
 
1. Specify prior (in Beta case often Beta(1,1))
 
2. Sample from each posterior distribution to get
 
estimated mean for each arm.
 
3. Pull arm with highest mean.
 
4. Repeat step 2 & 3 forever
 
Thompson Empirical Results
 
And shown to have optimal regret bounds just like
(and in some cases a little better than) UCB!
Problems with standard MAB
formulation
 
What happens if adversarial? (studied elsewhere)
What if arms are factored and share values?
What happens if there is delay?
What if we know a heuristic that we want to
 
incorporate without harming guarantees?
What if we have a prior dataset we want to use?
What if we want to evaluate algorithms/parameters
 
on previously collected data?
 
 
Problems with standard MAB
formulation
 
What happens if adversarial? (studied elsewhere)
What if arms are factored and share values?
What happens if there is delay?
What if we know a heuristic that we want to
 
incorporate without harming guarantees?
What if we have a prior dataset we want to use?
What if we want to evaluate algorithms/parameters
 
on previously collected data?
 
A look toward more complexity:
Reinforcement Learning
 
Sequence of actions instead of take
one and reset
Rich space of observations which
should influence our choice of
action
 
Much more powerful, but a much
harder problem!
 
Summary
 
Online Learning
Adversarial vs Statistical
Feedback
Regret
Experts Problem
EG algorithm
Multi-Armed bandits
e-greedy, e-first
UCB
Thompson Sampling
 
Questions?
Slide Note
Embed
Share

Explore the world of online learning in machine learning through topics like supervised learning, unsupervised learning, and more. Dive into concepts such as active learning, reinforcement learning, and the challenges of changing data distributions over time.

  • Machine Learning
  • Online Learning
  • Supervised Learning
  • Unsupervised Learning
  • Active Learning

Uploaded on Sep 25, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Online Learning: An Introduction Travis Mandel University of Washington {

  2. Overview of Machine Learning Supervised Learning Decision Trees, Na ve Bayes, Linear Regression, K-NN, etc. Unsupervised Learning K-means, Association Rule Learning, etc. What else is there????

  3. Overview of Machine Learning Supervised Learning Classifier/ Regression function ML Labeled Training Data Algorithm Unsupervised Learning ML Unlabeled Training Data Clusters/Rules Algorithm Is this the best way to learn?

  4. Overview of Machine Learning Supervised Learning Unsupervised Learning Semi-Supervised Learning Active Learning Interactive Learning Online Learning Reinforcement Learning

  5. Online Classification Observe One Training Instance ML Observes Label, Receives Loss/Reward Algorithm Issues prediction

  6. More generally Online Learning! Observe observations Observe Loss/Reward Function Receives Loss/Reward ML Algorithm Selects output

  7. Typical Assumption: Statistical Feedback Instances and labels drawn from a fixed distribution D

  8. Recall: Distribution drug ? nigeria ? Spam? Prob N N N 0.5 N N Y 0.1 N Y N 0.05 N Y Y 0.1 Y N N 0.05 Y N Y 0.1 Y Y N 0.0 Y Y Y 0.1

  9. Alternate Assumption: Adversarial Feedback Instances and labels drawn adversarially Spam detection, anomaly detection, etc. Change over time a HUGE Issue!

  10. Change over time example

  11. How can we measure performance? What we want to maximize: Sum of rewards Why could this be a problem?

  12. Alternative: Discounted Sum of Rewards ?0+ ?1+ ?2+ ?3+ ?0+ ?1+ 2?2+ 3?3+ Guaranteed to be a finite sum if <1, r<c! One problem If a small percent of times you never find the best option (Incomplete learning), it doesn t matter much! But intuitively it should matter a LOT!

  13. Regret Sum of rewards compared to the BEST choice in hindsight! ?? ?? ?=1 ?=1 If training a linear model, optimal weights with overfitting! See other examples in a second Can get good regret even if perform very poorly in absolute terms

  14. We really like theoretical guarantees! Anyone know why? Running online uncertain future Hard to tell if you reached best possible performance by looking at data

  15. Recall Observe observations Observe Loss/Reward Function Receives Loss/Reward ML Algorithm Selects output

  16. Experts Setting N experts Assume rewards in [0,1] Observe expert recommendations Observe Loss/Reward For each expert Receives Loss/Reward ML Algorithm Selects Expert

  17. Regret in the Experts Setting Do as well as the best expert in hindsight! Hopefully you have (at least some) good experts!

  18. Choose an Expert? Each expert could make a prediction (different ML models, classifiers, hypotheses, etc.) Each expert could tell us to perform an action (such as buy a stock) Experts can learn & change over time!

  19. Simple strategies Pick Expert that s done the best in the past? Pick Expert that s done the best recently?

  20. No! Adversarial! For ANY deterministic strategy The adversary can know which expert we will choose, and give it low reward/high loss, but all others high reward/low loss! Every step, get further away from best expert Linear regret

  21. Solution: Randomness What if the algorithm has access to some random bits? We can do better!

  22. EG: Exponentiated Gradient ?1= 1, ,1 ??,?= ?? 1,? ??(??,? 1) 2 ???? ? ? = Probabilities are normalized weights

  23. EG: Regret Bound ?( 2?log?)

  24. Problem: What if the feedback is a little different? In the real world, we usually don t get to see what would have happened Shift from predicting future to taking actions in an unknown world Simplest Reinforcement Learning problem

  25. Multi-Armed Bandit (MAB) Problem N arms Assume Statistical feedback Assume rewards in [0,1] Observes Loss/Reward For The Chosen Arm ML Algorithm Suffers Loss/Reward Pulls Arm

  26. Regret in the MAB Setting Since we are stochastic, each arm has a true mean Do as well as if pulled always pulled arm with highest mean

  27. MABs have a huge number of applications! Internet Advertising Website Layout Clinical Trials Robotics Education Videogames Recommender Systems Unlike most of Machine Learning, how to ACT, not just how to predict!

  28. A surprisingly tough problem The problem is a classic one; it was formulated during the war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage. Prof. Peter Whittle, Journal of the Royal Statistical Society, 1979

  29. MAB algorithm #1: -first Phase 1: Explore arms uniformly at random Phase 2: Exploit best arm so far Only 1 parameter: how long to wait! Problem? Not very efficient Makes statistics easy to run

  30. MAB algorithm #2: -greedy Flip a coin: With probability , explore uniformly With probability 1- , exploit Only 1 parameter: Problem? Often used in robotics and reinforcement learning

  31. Idea The problem is uncertainty How to quantify uncertainty? Error bars If arm has been sampled n times, With probability at least 1- ?: log(2 ?) ? ? < 2?

  32. Given Error bars, how do we act?

  33. Given Error bars, how do we act? Optimism under uncertainty! Why? If bad, we will soon find out!

  34. One last wrinkle How to set confidence ? Decrease over time If arm has been sampled n times, With probability at least 1- ?: log(2 ?) ? ? < 2? ? =2 ?

  35. Upper Confidence Bound (UCB) 1. Play each arm once 2. Play arm i that maximizes: 2log(?) ?? ??+ 3. Repeat Step 2 forever

  36. UCB Regret Bound Problem independent: ?( ??????) Problem dependent: log? ? ?? ? min ? Can t do much better!

  37. A little history William R. Thompson (1933): Was the first to examine MAB problem, proposed a method for solving them 1940s-50s: MAB problem studied intentively during WWII, Thompson ignored 1970 s-1980 s: Optimal solution (Gittins index) found but is intractable and incomplete. Thompson ignored. 2001: UCB proposed, gains widespread use due to simplicity and optimal bounds. Thompson still ignored. 2011: Empricial results show Thompson s 1933 method beats UCB, but little interest since no guarantees. 2013: Optimal bounds finally shown for Thompson Sampling

  38. Thompsons method was fundamentally different!

  39. Bayesian vs Frequentist Bayesians: You have a prior, probabilities interpreted as beliefs, prefer probabilistic decisions Frequentists: No prior, probabilities interpreted as facts about the world, prefer hard decisions (p<0.05) UCB is a frequentist technique! What if we are Bayesian?

  40. Bayesian review: Bayes Rule ? ???? ? ?(?) ?(????) ? ? ????) = Posterior ? ? ????) ? ???? ? ?(?) Prior Likelihood

  41. Bernoulli Case What if distribution in the set {0,1} instead of the range [0,1]? Then we flip a coin with probability p Bernoulli Distribution! To estimate p, we count up numbers of ones and zeros Given observed ones and zeroes, how do we calculate the distibution of possible values of p?

  42. Beta-Bernoulli Case Beta(a,b) Given a 0 s and b 1 s, what is the distribution over means? Prior pseudocounts Likelihood Observed counts Posterior psuedocounts + observed counts

  43. How does this help us? Thompson Sampling: 1. Specify prior (in Beta case often Beta(1,1)) 2. Sample from each posterior distribution to get estimated mean for each arm. 3. Pull arm with highest mean. 4. Repeat step 2 & 3 forever

  44. Thompson Empirical Results And shown to have optimal regret bounds just like (and in some cases a little better than) UCB!

  45. Problems with standard MAB formulation What happens if adversarial? (studied elsewhere) What if arms are factored and share values? What happens if there is delay? What if we know a heuristic that we want to incorporate without harming guarantees? What if we have a prior dataset we want to use? What if we want to evaluate algorithms/parameters on previously collected data?

  46. Problems with standard MAB formulation What happens if adversarial? (studied elsewhere) What if arms are factored and share values? What happens if there is delay? What if we know a heuristic that we want to incorporate without harming guarantees? What if we have a prior dataset we want to use? What if we want to evaluate algorithms/parameters on previously collected data?

  47. A look toward more complexity: Reinforcement Learning Sequence of actions instead of take one and reset Rich space of observations which should influence our choice of action Much more powerful, but a much harder problem!

  48. Summary Online Learning Adversarial vs Statistical Feedback Regret Experts Problem EG algorithm Multi-Armed bandits e-greedy, e-first UCB Thompson Sampling

  49. Questions?

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#