Reinforcement Learning for Long-Horizon Tasks and Markov Decision Processes

Slide Note
Embed
Share

Delve into the world of reinforcement learning, where tasks are accomplished by generating policies in a Markov Decision Process (MDP) environment. Understand the concepts of MDP, transition probabilities, and generating optimal policies in unknown and known environments. Explore algorithms and toolkits like Q-learning and neural networks for deep reinforcement learning. Discover the challenges in handling long-horizon tasks, manipulation, planning, and multi-agent systems that push the boundaries of state-of-the-art solutions.


Uploaded on Dec 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Reinforcement Learning for long-horizon tasks Suguman Bansal

  2. Reinforcement Learning (RL) Given a task, generate a policy that accomplishes it. Environment Action State Reward Policy 2

  3. Markov Decision Process (MDP) Environment is a Markov Decision Process (MDP) ? = ?,?,?,?,? ? is the set of states ? is the set of actions ? ? ? ? [0,1] is transition probability ?(?,?,? ) is probability of transitioning from ? to ? on ? ? ? 0,1 is the initial state distribution ? ? ? ? is the reward function Suguman Bansal | U Penn 3

  4. RL Problem Given, an MDP Assuming transition probabilities ? are unknown Generate, Policy ?: ? ? ? ?(?) s.t. it optimizes (expected) discounted-sum reward Suguman Bansal | U Penn 4

  5. If transition probabilities were known . Given, an MDP } Optimization Problem Generate, Policy ?: ? ? ? ?(?) s.t. it optimizes (expected) discounted-sum reward Solve via Bellman equations (Value-iteration, policy iteration) Deterministic, memoryless optimal policy exists [Shapely] Probabilities are known ??????? = maxa ?????? ? ? ? ?,?,? ?????? 1(?) Suguman Bansal | U Penn 5

  6. RL Problem Given, an MDP Assuming transition probabilities ? are unknown Generate, Policy ?: ? ? ? ?(?) s.t. it optimizes (expected) discounted-sum reward Simulator of MDP is available (current state is always known) Sample actions from current state Reset to initial states(s) Suguman Bansal | U Penn 6

  7. RL Algorithms and Toolkits Sampling-based algorithms Simulate to estimate transition probabilities Generate optimal policy of estimated MDP Q-learning :: Asymptotic guarantees [Watkins and Dayan. ML 1992] Explore-Exploit-Estimate :: PAC guarantees [Kearns and Singh. ML 2002] Policy representation?:? ? Finite-state MDP :: Tabular Infinite-state MDP :: Function approximation Neural network: Deep RL Suguman Bansal | U Penn 7

  8. Challenge : RL for long-horizon tasks Manipulation and planning Games and Multi-agent systems SoTA requires an astronomical number of samples, if solvable 8

  9. A. Expressiveness Cumbersome to express long-horizon tasks as rewards ? # reach ? # reach ? then reach ? def reward_function(s): if state.at(?) : return 1 else : return 0 ### reach ? ### reach ? Initial ? Temporal logics vs. rewards [Li, Vasile, Belta IROS 17] [Camacho et. al. IJCAI 19] [Bozkurt et. al. 2019] [Hahn et. al. TACAS 19, ATVA 20] [Balakrishnan and Deshmukh. IROS 19] [Kalagarla, Jain, and Nuzzo. CDC 2021] [Jothimurugan, Bansal, Bastani, Alur. NeurIPS 2021] Desired policy Learnt policy 9

  10. B. Scalability I. Sparsity of rewards II. RL is greedy RL + Planning + Compositional RL [Jothimurugan, Bansal, Bastani, Alur. NeurIPS 2021] Visit ?1 or ?2 then visit ?3while avoiding ? Visit ?1 or ?2 while avoiding ? Probability Probability Number of samples Futile to learn to visit ?2 Better to learn to visit ?1 Number of samples Learns to visit ?2 via obstacle-free path

  11. C. No Guarantees Theorem [Alur, Bansal, Bastani, Jothimurugan. (To appear) Henzinger-60. CoRR 2021] There can not exist a PAC algorithm for RL from temporal specifications Spec: Always (Not Unsafe) How to obtain guarantees, especially in NN-enabled policies? Unsafe Explanation to learnt policies Interpretable summaries of policies ?,? ?,1 ? Safe Init ?,1 11

More Related Content