Intelligent Systems and Reinforcement Learning Overview

intelligent systems ai intelligent systems ai 2 2 n.w
1 / 20
Embed
Share

Explore the world of intelligent systems, reinforcement learning, and Markov decision processes (MDPs) in computer science. Dive into topics like Value Iteration, Partially Observable MDPs, Hybrid AI approaches, and applications of AI. Understand MDPs and Reinforcement Learning, then delve into Policy Search based on stochastic local search techniques.

  • Intelligent Systems
  • Reinforcement Learning
  • Computer Science
  • AI Applications
  • Markov Decision Processes

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Intelligent Systems (AI Intelligent Systems (AI- -2) 2) Computer Science cpsc422, Lecture 7 Computer Science cpsc422, Lecture 7 Jan, 25, 2021 Jan, 25, 2021 CPSC 422, Lecture 7 Slide 1

  2. Course Announcements Course Announcements Assignment 1 has been posted (due Sep 27) ValueOfInfo and ValueOfControl MDPs: Value Iteration POMDPs: Belief State Update CPSC 422, Lecture 8 2

  3. StarAI StarAI (statistical relational AI) (statistical relational AI) Hybrid: Hybrid: Det Det + +Sto Sto 422 big picture 422 big picture Parsing Parsing Prob Prob CFG CFG Prob Prob Relational Models Relational Models Markov Logics Markov Logics Deterministic Deterministic Stochastic Stochastic Belief Nets Belief Nets Approx. : Gibbs Approx. : Gibbs Logics Logics First Order Logics First Order Logics Markov Chains and HMMs Markov Chains and HMMs Forward, Viterbi . Forward, Viterbi . Approx. : Particle Filtering Approx. : Particle Filtering Ontologies Ontologies Query Query Undirected Graphical Models Undirected Graphical Models Markov Networks Markov Networks Conditional Random Fields Conditional Random Fields Full Resolution Full Resolution SAT SAT Markov Decision Processes and Markov Decision Processes and Partially Observable MDP Partially Observable MDP Planning Planning Value Iteration Value Iteration Approx. Inference Approx. Inference Reinforcement Learning Reinforcement Learning Representation Representation Reasoning Reasoning Technique Technique Slide 3 Applications of AI Applications of AI CPSC 422, Lecture 35

  4. Lecture Overview Lecture Overview Start Reinforcement Learning Start Q-learning Estimate by Temporal Differences CPSC 422, Lecture 7 4

  5. MDP and Reinforcement Learning (RL) MDP and Reinforcement Learning (RL) Markov decision process Markov decision process Set of states states S, set of actions actions A Transition Transition probabilities to next states P(s | s, a ) Reward Reward function R(s) or R(s, a) or R(s, a, s ) RL is based on RL is based on MDPs Transition model is not known Reward model is not known While for MDPs MDPs we can compute RL RL learns learns an optimal policy MDPs, but , but not known not known compute an optimal policy CPSC 422, Lecture 7 5

  6. Search Search- -Based Approaches to RL Based Approaches to RL Policy Search ( Policy Search (stochastic local search) Start with an arbitrary policy To evaluate a policy, try it out in the world Generate some neighbours .. Problems with evolutionary algorithms Problems with evolutionary algorithms Policy space can be huge Policy space can be huge: with n states and m actions there are mn policies Policies are evaluated as a whole Policies are evaluated as a whole: cannot directly take into account locally good/bad behaviours 6 CPSC 422, Lecture 7

  7. Q Q- -learning learning Contrary to search-based approaches, Q learns after every action learns after every action Learns components of a policy Learns components of a policy, rather than the policy itself Q( Q(s,a s,a) ) = expected value of doing action a a in state s s and then following the optimal policy , Q- -learning learning CPSC 422, Lecture 7 7

  8. Q values Q values s0 s1 sk a0 Q[s0,a0] Q[s1,a0] Q[sk,a0] . a1 Q[s0,a1] Q[s1,a1] Q[sk,a1] . an Q[s0,an] Q[s1,an] Q[sk,an] . If the agent had the complete Q to act in every state? A. A. Yes complete Q- -function function, would it know how B. B. No But how to learn the Q But how to learn the Q- -values? values? CPSC 422, Lecture 7 9

  9. Q values Q values s = + * ( , ) ( ) ( | ' s , ) ( ) ' s (1) Q s a R s P s a V ' Q( Q(s,a s,a) ) are known as Q Q- -values state s s as follows values, and are related to the utility of = * ( ) max a ( , ) a (2) V s Q s From (1) and (2) we obtain a constraint between the Q value in state s and the Q value of the states reachable from a ? ? ?,? ?(?,?) = ?(?) + ? ? CPSC 422, Lecture 7 10

  10. Learning the Q values Learning the Q values Can we exploit the relation between Q values in adjacent states? ) ( ) , ( s A. A. Yes B. B. No = + ( | ' s , ) max a ( , ' s ) ' a Q s a R s P s a Q ' ' CPSC 422, Lecture 7 11

  11. Average Through Time Average Through Time Suppose we have a sequence of values (your sample data): v1, v2, .., vk And want a running approximation of their expected value e.g., given sequence of grades, estimate expected value of next grade A reasonable estimate estimate is the average of the first k values: v v A k = + + + .... v 1 2 k k CPSC 422, Lecture 7 12

  12. Average Through Time Average Through Time + + + .... v v v = 1 2 k A k k = + + + + .... equivalent and ly for : 1 k- kA v v v v 1 2 1 k k k ) 1 = + + + ( .... which replaced in the equation above gives k A v v v 1 1 2 1 k k = ) 1 + ( Dividing by we get : kA k A v k 1 k k k 1 v = + 1 ( ) k A A 1 k k k k k= and if we set 1 / k = + 1 ( ) A A v 1 k k k k k = + ( ) A v A 1 1 k k k k CPSC 422, Lecture 7 13

  13. Estimate by Temporal Differences Estimate by Temporal Differences = + ( ) A A v A 1 1 k k k k k (vk - Ak-1) is called a temporal difference error error error it specifies how different the new value vk is from the prediction given by the previous running average Ak-1 The new estimate (average) is obtained by updating the previous average by k times the TD error temporal difference error or TD TD- - CPSC 422, Lecture 7 14

  14. Q Q- -learning: General Idea learning: General Idea Learn from the history i.e., a sequence of state-action-rewards <s0, a0, r1, s1, a1, r2, s2, a2, r3,.....> History is seen as sequence of experiences <s, a, r, s > agent doing action a a in state s s, receiving reward r r and ending up in s s history of interaction with the environment, experiences, i.e., tuples These experiences are used to estimate the value of Q (s,a) expressed as 15 CPSC 422, Lecture 7

  15. Q Q- -learning: General Idea learning: General Idea But .. = + ( , ) max a [ , ' s ] ' a Q s a r Q ' Is an approximation approximation. The real link between Q(s,a) and Q(s ,a ) is s = + ( , ) ( | ' s , ) max a ( , ' s ) ' a Q s a r P s a Q ' ' CPSC 422, Lecture 7 16

  16. Q Q- -learning: Main steps learning: Main steps Store Q[S, A], Q[S, A], for every state S and action A in the world Start with arbitrary estimates arbitrary estimates in Q (0)[S, A], Update them by using experiences Each experience experience <s, a, r, s > provides one new data point on the actual value of Q[s, a] current estimated value of Q[s ,a ], where s is the state the agent arrives to in the current experience = + [ , ] max a [ , ' s ] ' a Q s a r Q ' New value of Q[s,a], CPSC 422, Lecture 7 17

  17. Q Q- -learning: Update step learning: Update step = + ( ) A A v A 1 1 k k k k k TD formula applied to Q[s,a] ) 1 ) 1 ) 1 + + ( ) ( ( ( i i i i [ , ] [ , ] (( max a [ , ' s ' ]) [ , ]) Q s a Q s a r Q a Q s a k ' New value for Q[s,a] from <s,a,r,s > updated estimated value of Q[s,a] Previous estimated value of Q[s,a] CPSC 422, Lecture 7 18

  18. Q Q- -learning: algorithm learning: algorithm CPSC 422, Lecture 7 19

  19. Learning Goals for todays class Learning Goals for today s class You can: You can: Describe and criticize search-based approaches to RL Motivate Q-learning Justify Estimate by Temporal Differences Explain, trace and implement Q-learning CPSC 422, Lecture 7 Slide 20

  20. TODO for Wed TODO for Wed Do Practice Ex. On Do Practice Ex. On Reinforcement Learning: Exercise 11.A: Q-learning http://www.aispace.org/exercises.shtml Keep working on assignment 1 ! CPSC 422, Lecture 7 Slide 21

More Related Content