Deep Reinforcement Learning Overview and Applications

Slide Note

Delve into the world of deep reinforcement learning on the road to advanced AI systems like Skynet. Explore topics ranging from Markov Decision Processes to solving MDPs, value functions, and tabular solutions. Discover the paradigm of supervised, unsupervised, and reinforcement learning in various applications. Uncover the importance of setting, discount factor, and value functions in creating efficient learning algorithms.

struan Follow

Uploaded on Jul 15, 2024 | 3 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

DEEP REINFORCEMENT LEARNING ON THE ROAD TO SKYNET! UW CSE Deep Learning Felix Leeb

OVERVIEW TODAY NEXT TIME MDPs formalizing decisions Model Based RL forward/inverse Function Approximation Planning MCTS, MPPI Value Function DQN Imitation Learning DAgger, GAIL Policy Gradients REINFORCE, NPG Advanced Topics Exploration, MARL, Meta-learning, LMDPs Actor Critic A3C, DDPG UW CSE DEEP LEARNING - FELIX LEEB 2

Paradigm Supervised Learning Unsupervised Learning Reinforcement Learning Objective Classification Regression Inference Generation Prediction Control Applications UW CSE DEEP LEARNING - FELIX LEEB 3

Prediction ? ? Control ? ? ? UW CSE DEEP LEARNING - FELIX LEEB 4

SETTING Environment State/Observation Action Reward Agent using policy UW CSE DEEP LEARNING - FELIX LEEB 5

MARKOV DECISION PROCESSES Transition function Reward function State space Action space UW CSE DEEP LEARNING - FELIX LEEB 6

DISCOUNT FACTOR We want to be greedy but not impulsive Implicitly takes uncertainty in dynamics into account Mathematically: <1 allows infinite horizon returns Return: UW CSE DEEP LEARNING - FELIX LEEB 7

SOLVING AN MDP Objective: Goal: UW CSE DEEP LEARNING - FELIX LEEB 8

VALUE FUNCTIONS Value = expected gain of a state Q function action specific value function Advantage function how much more valuable is an action Value depends on future rewards depends on policy UW CSE DEEP LEARNING - FELIX LEEB 9

TABULAR SOLUTION: POLICY ITERATION Policy Evaluation Policy Update UW CSE DEEP LEARNING - FELIX LEEB 10

Q LEARNING UW CSE DEEP LEARNING - FELIX LEEB 11

FUNCTION APPROXIMATION Model: Training data: Loss function: where UW CSE DEEP LEARNING - FELIX LEEB 12

IMPLEMENTATION Action-in Action-out Off-Policy Learning The target depends in part on our model old observations are still useful Use a Replay Buffer of most recent transitions as dataset UW CSE DEEP LEARNING - FELIX LEEB 13

DEEP Q NETWORKS (DQN) Mnih et al. (2015) UW CSE DEEP LEARNING - FELIX LEEB 14

DQN ISSUES Convergence is not guaranteed hope for deep magic! Replay Buffer Error Clipping Reward scaling Using replicas Double Q Learning decouple action selection and value estimation UW CSE DEEP LEARNING - FELIX LEEB 15

POLICY GRADIENTS Parameterize policy and update those parameters directly Enables new kinds of policies: stochastic, continuous action spaces On policy learning learn directly from your actions UW CSE DEEP LEARNING - FELIX LEEB 16

POLICY GRADIENTS Approximate expectation value from samples UW CSE DEEP LEARNING - FELIX LEEB 17

REINFORCE Sutton et al. (2000) UW CSE DEEP LEARNING - FELIX LEEB 18

VARIANCE REDUCTION Constant offsets make it harder to differentiate the right direction Remove offset a priori value of each state UW CSE DEEP LEARNING - FELIX LEEB 19

ADVANCED POLICY GRADIENT METHODS For stochastic functions, the gradient is not the best direction Consider the KL divergence NPG TRPO PPO Approximating the Fisher information matrix Computing gradients with KL constraint Gradients with KL penalty UW CSE DEEP LEARNING - FELIX LEEB 20

ADVANCED POLICY GRADIENT METHODS Heess et al. (2017) Rajeswaran et al. (2017) UW CSE DEEP LEARNING - FELIX LEEB 21

ACTOR CRITIC Critic using Q learning update Estimate Advantage Propose Actions Actor using policy gradient update UW CSE DEEP LEARNING - FELIX LEEB 22