Deep Reinforcement Learning Overview and Applications

undefined
 
DEEP REINFORCEMENT
LEARNING
 
ON THE ROAD TO SKYNET!
 
UW CSE Deep Learning – Felix Leeb
 
OVERVIEW
 
TODAY
 
 
 
MDPs
MDPs
 – formalizing decisions
 – formalizing decisions
 Function Approximation
 Function Approximation
 Value Function – 
 Value Function – 
DQN
DQN
 Policy Gradients – 
 Policy Gradients – 
REINFORCE, NPG
REINFORCE, NPG
 Actor Critic – 
 Actor Critic – 
A3C, DDPG
A3C, DDPG
 
NEXT TIME
 
 
 
Model Based RL 
Model Based RL 
– forward/inverse
– forward/inverse
 Planning – MCTS, MPPI
 Planning – MCTS, MPPI
 Imitation Learning – 
 Imitation Learning – 
DAgger, GAIL
DAgger, GAIL
 Advanced Topics – 
 Advanced Topics – 
Exploration, MARL,
Exploration, MARL,
Meta-learning, LMDPs…
Meta-learning, LMDPs…
 
UW CSE DEEP LEARNING - FELIX LEEB
 
2
 
Classification
Regression
 
Inference
Generation
 
Prediction
Control
 
Objective
 
Applications
 
Paradigm
 
UW CSE DEEP LEARNING - FELIX LEEB
 
3
 
Prediction
 
Control
 
UW CSE DEEP LEARNING - FELIX LEEB
 
4
 
SETTING
 
UW CSE DEEP LEARNING - FELIX LEEB
 
5
Agent
Environment
 
Action
 
State/Observation
Reward
 
using policy
 
UW CSE DEEP LEARNING - FELIX LEEB
 
6
 
DISCOUNT FACTOR
 
We want to be 
greedy
 but not 
impulsive
Implicitly takes uncertainty in dynamics into account
Mathematically: γ<1 allows infinite horizon returns
 
UW CSE DEEP LEARNING - FELIX LEEB
 
7
 
Return:
 
SOLVING AN MDP
 
Goal:
 
UW CSE DEEP LEARNING - FELIX LEEB
 
8
 
Objective:
 
VALUE FUNCTIONS
 
Value = 
expected gain 
of a state
Q function – 
action specific 
value function
Advantage function – how much 
more
 valuable is an action
Value depends on future rewards 
 depends on 
policy
 
UW CSE DEEP LEARNING - FELIX LEEB
 
9
 
TABULAR SOLUTION: POLICY ITERATION
 
UW CSE DEEP LEARNING - FELIX LEEB
 
10
 
Q LEARNING
 
UW CSE DEEP LEARNING - FELIX LEEB
 
11
 
FUNCTION APPROXIMATION
 
UW CSE DEEP LEARNING - FELIX LEEB
 
12
 
Loss function:
 
where
 
IMPLEMENTATION
 
UW CSE DEEP LEARNING - FELIX LEEB
 
13
 
Action-in
 
Action-out
 
Off-Policy Learning
 
The target depends in part
on our model 
 
old
observations 
are still useful
Use a 
Replay Buffer 
of
most recent transitions as
dataset
 
DEEP Q NETWORKS (DQN)
 
UW CSE DEEP LEARNING - FELIX LEEB
 
14
 
Mnih et al. (2015)
 
DQN ISSUES
 
UW CSE DEEP LEARNING - FELIX LEEB
 
15
 
Convergence is not guaranteed – hope for deep magic!
 
Reward scaling
 
Double Q Learning – decouple action selection and value estimation
 
POLICY GRADIENTS
 
UW CSE DEEP LEARNING - FELIX LEEB
 
16
 
Parameterize policy and update those parameters directly
Enables new kinds of policies: stochastic, continuous action spaces
 
On policy learning 
 learn directly from your actions
 
POLICY GRADIENTS
 
UW CSE DEEP LEARNING - FELIX LEEB
 
17
 
Approximate expectation value from samples
 
REINFORCE
 
UW CSE DEEP LEARNING - FELIX LEEB
 
18
 
Sutton et al. (2000)
 
VARIANCE REDUCTION
 
UW CSE DEEP LEARNING - FELIX LEEB
 
19
 
Constant offsets make it harder to differentiate the
right direction
Remove offset 
 a priori value of each state
 
ADVANCED POLICY GRADIENT METHODS
 
UW CSE DEEP LEARNING - FELIX LEEB
 
20
 
For stochastic functions, the gradient is not the best direction
Consider the KL divergence
 
ADVANCED POLICY GRADIENT METHODS
 
UW CSE DEEP LEARNING - FELIX LEEB
 
21
 
Rajeswaran et al. (2017)
 
Heess et al. (2017)
 
ACTOR CRITIC
 
UW CSE DEEP LEARNING - FELIX LEEB
 
22
Critic
Actor
 
Estimate
Advantage
 
Propose
Actions
 
using Q learning update
 
using policy gradient update
 
ASYNC ADVANTAGE ACTOR-CRITIC (A3C)
 
UW CSE DEEP LEARNING - FELIX LEEB
 
23
 
Mnih et al. (2016)
 
DDPG
 
UW CSE DEEP LEARNING - FELIX LEEB
 
24
Max Ferguson (2017)
 
Off-policy learning – using deterministic policy gradients
Slide Note
Embed
Share

Delve into the world of deep reinforcement learning on the road to advanced AI systems like Skynet. Explore topics ranging from Markov Decision Processes to solving MDPs, value functions, and tabular solutions. Discover the paradigm of supervised, unsupervised, and reinforcement learning in various applications. Uncover the importance of setting, discount factor, and value functions in creating efficient learning algorithms.

  • Reinforcement Learning
  • Deep Learning
  • Applications
  • Value Functions
  • Paradigm

Uploaded on Jul 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. DEEP REINFORCEMENT LEARNING ON THE ROAD TO SKYNET! UW CSE Deep Learning Felix Leeb

  2. OVERVIEW TODAY NEXT TIME MDPs formalizing decisions Model Based RL forward/inverse Function Approximation Planning MCTS, MPPI Value Function DQN Imitation Learning DAgger, GAIL Policy Gradients REINFORCE, NPG Advanced Topics Exploration, MARL, Meta-learning, LMDPs Actor Critic A3C, DDPG UW CSE DEEP LEARNING - FELIX LEEB 2

  3. Paradigm Supervised Learning Unsupervised Learning Reinforcement Learning Objective Classification Regression Inference Generation Prediction Control Applications UW CSE DEEP LEARNING - FELIX LEEB 3

  4. Prediction ? ? Control ? ? ? UW CSE DEEP LEARNING - FELIX LEEB 4

  5. SETTING Environment State/Observation Action Reward Agent using policy UW CSE DEEP LEARNING - FELIX LEEB 5

  6. MARKOV DECISION PROCESSES Transition function Reward function State space Action space UW CSE DEEP LEARNING - FELIX LEEB 6

  7. DISCOUNT FACTOR We want to be greedy but not impulsive Implicitly takes uncertainty in dynamics into account Mathematically: <1 allows infinite horizon returns Return: UW CSE DEEP LEARNING - FELIX LEEB 7

  8. SOLVING AN MDP Objective: Goal: UW CSE DEEP LEARNING - FELIX LEEB 8

  9. VALUE FUNCTIONS Value = expected gain of a state Q function action specific value function Advantage function how much more valuable is an action Value depends on future rewards depends on policy UW CSE DEEP LEARNING - FELIX LEEB 9

  10. TABULAR SOLUTION: POLICY ITERATION Policy Evaluation Policy Update UW CSE DEEP LEARNING - FELIX LEEB 10

  11. Q LEARNING UW CSE DEEP LEARNING - FELIX LEEB 11

  12. FUNCTION APPROXIMATION Model: Training data: Loss function: where UW CSE DEEP LEARNING - FELIX LEEB 12

  13. IMPLEMENTATION Action-in Action-out Off-Policy Learning The target depends in part on our model old observations are still useful Use a Replay Buffer of most recent transitions as dataset UW CSE DEEP LEARNING - FELIX LEEB 13

  14. DEEP Q NETWORKS (DQN) Mnih et al. (2015) UW CSE DEEP LEARNING - FELIX LEEB 14

  15. DQN ISSUES Convergence is not guaranteed hope for deep magic! Replay Buffer Error Clipping Reward scaling Using replicas Double Q Learning decouple action selection and value estimation UW CSE DEEP LEARNING - FELIX LEEB 15

  16. POLICY GRADIENTS Parameterize policy and update those parameters directly Enables new kinds of policies: stochastic, continuous action spaces On policy learning learn directly from your actions UW CSE DEEP LEARNING - FELIX LEEB 16

  17. POLICY GRADIENTS Approximate expectation value from samples UW CSE DEEP LEARNING - FELIX LEEB 17

  18. REINFORCE Sutton et al. (2000) UW CSE DEEP LEARNING - FELIX LEEB 18

  19. VARIANCE REDUCTION Constant offsets make it harder to differentiate the right direction Remove offset a priori value of each state UW CSE DEEP LEARNING - FELIX LEEB 19

  20. ADVANCED POLICY GRADIENT METHODS For stochastic functions, the gradient is not the best direction Consider the KL divergence NPG TRPO PPO Approximating the Fisher information matrix Computing gradients with KL constraint Gradients with KL penalty UW CSE DEEP LEARNING - FELIX LEEB 20

  21. ADVANCED POLICY GRADIENT METHODS Heess et al. (2017) Rajeswaran et al. (2017) UW CSE DEEP LEARNING - FELIX LEEB 21

  22. ACTOR CRITIC Critic using Q learning update Estimate Advantage Propose Actions Actor using policy gradient update UW CSE DEEP LEARNING - FELIX LEEB 22

  23. ASYNC ADVANTAGE ACTOR-CRITIC (A3C) Mnih et al. (2016) UW CSE DEEP LEARNING - FELIX LEEB 23

  24. DDPG Off-policy learning using deterministic policy gradients Max Ferguson (2017) UW CSE DEEP LEARNING - FELIX LEEB 24

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#