Deep Reinforcement Learning Overview and Applications
Delve into the world of deep reinforcement learning on the road to advanced AI systems like Skynet. Explore topics ranging from Markov Decision Processes to solving MDPs, value functions, and tabular solutions. Discover the paradigm of supervised, unsupervised, and reinforcement learning in various applications. Uncover the importance of setting, discount factor, and value functions in creating efficient learning algorithms.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
DEEP REINFORCEMENT LEARNING ON THE ROAD TO SKYNET! UW CSE Deep Learning Felix Leeb
OVERVIEW TODAY NEXT TIME MDPs formalizing decisions Model Based RL forward/inverse Function Approximation Planning MCTS, MPPI Value Function DQN Imitation Learning DAgger, GAIL Policy Gradients REINFORCE, NPG Advanced Topics Exploration, MARL, Meta-learning, LMDPs Actor Critic A3C, DDPG UW CSE DEEP LEARNING - FELIX LEEB 2
Paradigm Supervised Learning Unsupervised Learning Reinforcement Learning Objective Classification Regression Inference Generation Prediction Control Applications UW CSE DEEP LEARNING - FELIX LEEB 3
Prediction ? ? Control ? ? ? UW CSE DEEP LEARNING - FELIX LEEB 4
SETTING Environment State/Observation Action Reward Agent using policy UW CSE DEEP LEARNING - FELIX LEEB 5
MARKOV DECISION PROCESSES Transition function Reward function State space Action space UW CSE DEEP LEARNING - FELIX LEEB 6
DISCOUNT FACTOR We want to be greedy but not impulsive Implicitly takes uncertainty in dynamics into account Mathematically: <1 allows infinite horizon returns Return: UW CSE DEEP LEARNING - FELIX LEEB 7
SOLVING AN MDP Objective: Goal: UW CSE DEEP LEARNING - FELIX LEEB 8
VALUE FUNCTIONS Value = expected gain of a state Q function action specific value function Advantage function how much more valuable is an action Value depends on future rewards depends on policy UW CSE DEEP LEARNING - FELIX LEEB 9
TABULAR SOLUTION: POLICY ITERATION Policy Evaluation Policy Update UW CSE DEEP LEARNING - FELIX LEEB 10
Q LEARNING UW CSE DEEP LEARNING - FELIX LEEB 11
FUNCTION APPROXIMATION Model: Training data: Loss function: where UW CSE DEEP LEARNING - FELIX LEEB 12
IMPLEMENTATION Action-in Action-out Off-Policy Learning The target depends in part on our model old observations are still useful Use a Replay Buffer of most recent transitions as dataset UW CSE DEEP LEARNING - FELIX LEEB 13
DEEP Q NETWORKS (DQN) Mnih et al. (2015) UW CSE DEEP LEARNING - FELIX LEEB 14
DQN ISSUES Convergence is not guaranteed hope for deep magic! Replay Buffer Error Clipping Reward scaling Using replicas Double Q Learning decouple action selection and value estimation UW CSE DEEP LEARNING - FELIX LEEB 15
POLICY GRADIENTS Parameterize policy and update those parameters directly Enables new kinds of policies: stochastic, continuous action spaces On policy learning learn directly from your actions UW CSE DEEP LEARNING - FELIX LEEB 16
POLICY GRADIENTS Approximate expectation value from samples UW CSE DEEP LEARNING - FELIX LEEB 17
REINFORCE Sutton et al. (2000) UW CSE DEEP LEARNING - FELIX LEEB 18
VARIANCE REDUCTION Constant offsets make it harder to differentiate the right direction Remove offset a priori value of each state UW CSE DEEP LEARNING - FELIX LEEB 19
ADVANCED POLICY GRADIENT METHODS For stochastic functions, the gradient is not the best direction Consider the KL divergence NPG TRPO PPO Approximating the Fisher information matrix Computing gradients with KL constraint Gradients with KL penalty UW CSE DEEP LEARNING - FELIX LEEB 20
ADVANCED POLICY GRADIENT METHODS Heess et al. (2017) Rajeswaran et al. (2017) UW CSE DEEP LEARNING - FELIX LEEB 21
ACTOR CRITIC Critic using Q learning update Estimate Advantage Propose Actions Actor using policy gradient update UW CSE DEEP LEARNING - FELIX LEEB 22
ASYNC ADVANTAGE ACTOR-CRITIC (A3C) Mnih et al. (2016) UW CSE DEEP LEARNING - FELIX LEEB 23
DDPG Off-policy learning using deterministic policy gradients Max Ferguson (2017) UW CSE DEEP LEARNING - FELIX LEEB 24