Challenges in Decision Making: Understanding the Complexity of Making Good Choices

Slide Note

Making good decisions involves maximizing future rewards while dealing with uncertainty, numerous potential outcomes, and high stakes. Reinforcement learning offers algorithms to guide decision-making processes. The difficulty lies in estimating future rewards, facing uncertainty, navigating through various states and actions, and the high stakes involved in making the right choices. This complexity is further exemplified in scenarios like proposing marriage, where the decision can impact one's happiness and future significantly.

fana Follow

Uploaded on Oct 02, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Decision making 101 Peter Latham Gatsby Computational Neuroscience Unit UCL October 30, 2010

Why is it hard to make good decisions? Because we have to maximize future rewards Do you propose marriage now and risk a lifetime of misery and suffering, or wait a little longer and risk your fianc running off with your best friend?

Reinforcement learning is a set of algorithms for learning how to make good decisions. States Actions S1 S3 r12 Rewards S2 r15 r43 r54 S5 S4 r26 r56 S6

Reinforcement learning is a set of algorithms for learning how to make good decisions. States Actions S1 S3 r12 Rewards S2 r15 r43 S5 r54 S4 r26 r56 S6

Reinforcement learning is a set of algorithms for learning how to make good decisions. States Actions S1 S3 r43 r12 Rewards S2 r15 S5 r54 S4 r26 r56 S6

Do you propose marriage now and risk a lifetime of misery and suffering, or wait a little longer and risk your fianc running off with your best friend? miserable happy don t propose no breakup propose 20 years later S1 yes in love but worried about an impending breakup in love and happy miserable

Decision-making is hard because: Future rewards have to be estimated Rewards are uncertain Lots of states Lots of possible actions - - - - But the stakes are high: If you don t choose the right actions, you risk a life of misery. And as a species you die out.

Outline 1. General theory of decision-making (reinforcement learning). 2. Decision-making under time pressure: deciding when to decide. a. experiments b. theory

The general problem: States Actions S1 S3 r43 r12 Rewards S2 r15 S5 r54 S4 r26 r56 S6

How do we figure out the optimal policy? Policy is just another name for a sequence of decisions! S1 1 5 S3 S2 3 -1 10 2 S4 S5 S6 S7

How do we figure out the optimal policy? * Vi = value of state i under the optimal policy S1 1 5 S3 S2 3 -1 the policy that maximizes reward 10 2 S4 S5 S6 S7 or, in more complicated games (like life), the policy that maximizes reward per unit time

How do we figure out the optimal policy? Trial and error! V1 = 11 * * Vi = value of state i under the optimal policy S1 1 5 * * V2 = 10 V3 = 3 S3 S2 3 -1 10 2 S4 S5 S6 S7 * * * * V4 = V5 = V6 = V7 = 0

How do we figure out the optimal policy? Trial and error! Lessons: S1 1 5 1. Greedy algorithms generally don t work very well in the long run. 2. It s one of the things that makes decision-making hard. S3 S2 3 -1 10 2 S4 S5 S6 S7

There are several other things that make decision-making hard.

The state space is generally very large. What determines your current state: S1 1 5 surroundings (room people, furniture, lighting, ) internal state (hungry, bored, stressed, ) external events (brexit, tube strike, brexit, your favourite football team is playing, brexit, ) - S3 S2 - 3 -1 10 2 S4 S5 S6 S7 - next layer: S8-S15 the layer after that: S15-S31 n layers: 2n-1 states.

The state space is generally very large and loopy. Rewards are typically probabilistic. S1 1 5 in realistic situations, the game doesn t end until you die! 6 S3 S2 3 -1 10 2 12 0 8 S4 S5 S6 S7 -2

The state space is generally very large and loopy. Rewards are typically probabilistic. State transitions are typically probabilistic. S1 500 with probability 0.1 1 5 with probability 0.9 6 average reward: 0.1 500 + 0.9 5 = 54.50 S3 S2 3 -1 10 2 12 0 8 S4 S5 S6 S7 -2

The state space is generally very large and loopy. Rewards are typically probabilistic. State transitions are typically probabilistic. S1 policy: leave for the train station at 8 AM. 1 p=0.9 p=0.1 S3 S2 -10 probability 0.9: catch the 8:10 train. 3 -1 10 2 probability 0.1: miss the 8:10 train; take the 8:40. S4 S5 S6 S7

The state space is generally very large and loopy. Rewards are typically probabilistic. State transitions are typically probabilistic. S1 policy: leave for the train station at 7:55 AM. 1 p=0.9 p=0.1 S3 S2 -10 probability 0.99: catch the 8:10 train. 3 -1 10 2 probability 0.01: miss the 8:10 train; take the 8:40. S4 S5 S6 S7

Summary so far. We have to learn an optimal policy, but: The state space is generally very large and loopy. Rewards are typically probabilistic. State transitions are typically probabilistic. Greedy algorithms generally don t work very well. And we have to use trial and error. It s no wonder our lives are a mess!

All algorithms for solving this problem are basically the same: S1 500 with probability 0.1 1 5 with probability 0.9 p=0.9 p=0.1 6 S3 S2 -10 3 -1 10 2 12 0 8 S4 S5 S6 S7 -2 Wander around the state space to estimate rewards and transition probabilities. Generally make decisions that maximize your best guess of what the long term payoff is (exploit). But sometimes make choices that seem suboptimal (explore). - - -

Theres a small but important twist: time matters current rewards are more valuable than future rewards. time itself is valuable. - - maximize reward/unit time. we ll come back to this. discount! we ll talk about this first

Discounting which would you choose: 10 now 11 tomorrow

Discounting which would you choose: 10 now 10.10 tomorrow

Discounting which would you choose: 10 now 9 tomorrow

Discounting which would you choose: 10 now 11 tomorrow 10 in one year 11 in one year + 1 day

Exponential discounting: V(t) = r(t) + r(t+1) + 2r(t+2) + = discount factor; between 0 and 1 near one: you think long term. you ll do well in the modern world. near zero: you think short term. you ll do poorly in the modern world.

Exponential discounting: V(t) = r(t) + r(t+1) + 2r(t+2) + S1 1 5 6 S3 S2 -10 3 -1 10 2 12 S4 S5 S6 S7 -2

Exponential discounting: V(t) = r(t) + r(t+1) + 2r(t+2) + S1 = r(t) 1 5 6 S3 S2 -10 3 -1 10 2 12 S4 S5 S6 S7 -2

Exponential discounting: V(t) = r(t) + r(t+1) + 2r(t+2) + V(t) = 5 1 S1 = r(t) 1 5 6 S3 S2 -10 = r(t+1) 3 -1 10 2 12 S4 S5 S6 S7 -2

Exponential discounting: V(t) = r(t) + r(t+1) + 2r(t+2) + V(t) = 1 + 10 V(t) = 5 1 S1 = r(t) 1 5 better when < 4/11 6 S3 S2 -10 = r(t+1) 3 -1 10 2 12 S4 S5 S6 S7 -2

Exponential discounting: V(t) = r(t) + r(t+1) + 2r(t+2) + V(t+1) = r(t+1) + r(t+2) + 2r(t+3) + V(t+1) = r(t+1) + 2r(t+2) + 3r(t+3) +

Exponential discounting: V(t) = r(t) + r(t+1) + 2r(t+2) + V(t+1) = r(t+1) + 2r(t+2) + 3r(t+3) + V(t) = r(t) + V(t+1)

Exponential discounting: V(t) = r(t) + V(t+1) This is most useful if you know perfectly what V(t+1) is. However, you re generally wrong, at least by a little bit. The reward prediction error tells you how wrong you are: (t) = r(t) [V(t) V(t+1)] dopamine signal actual reward predicted reward

(t) = r(t) [V(t) V(t+1)] dopamine signal actual reward predicted reward Dopamine (our reason for living) is generated by unpredicted reward!!! That s why you should never take your kid (or significant other) into a store and then buy him/her a present. Or else take them into stores, and most of the time don t buy any presents.

V(t) = r(t) + V(t+1) S1 1 5 = 0 S3 S2 3 -1 10 2 S4 S5 S6 S7

V(t) = r(t) V1 = 5 * S1 1 5 = 0 V3 = 3 * S3 S2 3 -1 10 2 S4 S5 S6 S7

V(t) = r(t) V1 = 5 * S1 1 5 = 0 V3 = 3 * S3 S2 3 -1 10 2 S4 S5 S6 S7

V(t) = r(t) + V(t+1) V1 = 5 * = 1 S1 1 5 = 0 V3 = 3 * S3 S2 3 -1 10 2 S4 S5 S6 S7

V(t) = r(t) + V(t+1) * V1 = 5 * V1 = 11 = 1 S1 1 5 = 0 V3 = 3 * S3 S2 3 -1 10 2 S4 S5 S6 S7 The short term approach is easy. But it often doesn t work well!

All algorithms for solving this problem are basically the same: S1 500 with probability 0.1 1 5 with probability 0.9 p=0.9 p=0.1 6 S3 S2 -10 3 -1 10 2 12 0 8 S4 S5 S6 S7 -2 Wander around the state space to estimate rewards and transition probabilities. Generally make decisions that maximize your best guess of what the long term payoff is (exploit). But sometimes make choices that seem suboptimal (explore). - - -

Theres a small but important twist: time matters current rewards are more valuable than future rewards. time itself is valuable. - - discount!

So far we have pretended like transitions are instantaneous. S1 1 5 p=0.9 p=0.1 6 S3 S2 -10 3 -1 10 2 12 0 8 S4 S5 S6 S7 -2 In fact, they take time.

Average value (without discounting) r(t) + r(t+1) + r(t+2) + T(t) + T(t+1) + T(t+2) + V(t) = average value We can increase the average value of a state by making decisions quickly. However, as we ll see, there s a down side to doing that. In fact, we ve already seen the down side in the marriage problem.

Were now going to apply these ideas to a simple problem: the famous random dot kinematogram

Task: look at the dots and guess which way they re moving.

Experimental setup Shadlen and Newsome (2001)

Experimental setup make saccade as soon as possible maximize reward/unit time dots on time Huk & Shadlen, J Neurosci 25:10420-10436 (2005)

This has many of the features of more realistic tasks: 1. Subjects have to integrate noisy evidence over time. - the longer you wait, the more information you have about direction. 2. Subjects have to decide when to make a saccade. - make a saccade too soon and you risk being wrong - wait too long and you throw away valuable time, and that reduces your average reward.