Understanding Infinite Horizon Markov Decision Processes

 
 
 
1
 
M
a
r
k
o
v
 
D
e
c
i
s
i
o
n
 
P
r
o
c
e
s
s
e
s
I
n
f
i
n
i
t
e
 
H
o
r
i
z
o
n
 
P
r
o
b
l
e
m
s
 
Alan Fern *
 
* Based in part on slides by Craig Boutilier and Daniel Weld
 
 
 
2
 
W
h
a
t
 
i
s
 
a
 
s
o
l
u
t
i
o
n
 
t
o
 
a
n
 
M
D
P
?
 
M
D
P
 
P
l
a
n
n
i
n
g
 
P
r
o
b
l
e
m
:
 
 
 
I
n
p
u
t
:
 
 
a
n
 
M
D
P
 
(
S
,
A
,
R
,
T
)
 
 
 
O
u
t
p
u
t
:
 
 
a
 
p
o
l
i
c
y
 
t
h
a
t
 
a
c
h
i
e
v
e
s
 
a
n
 
o
p
t
i
m
a
l
 
v
a
l
u
e
 
h
This depends on how we define the value of a policy
 
h
There are several choices and the solution algorithms
depend on the choice
 
h
We will consider two common choices
5
Finite-Horizon Value
5
I
n
f
i
n
i
t
e
 
H
o
r
i
z
o
n
 
D
i
s
c
o
u
n
t
e
d
 
V
a
l
u
e
 
 
3
D
i
s
c
o
u
n
t
e
d
 
I
n
f
i
n
i
t
e
 
H
o
r
i
z
o
n
 
M
D
P
s
 
h
Defining value as total reward is problematic with
infinite horizons 
(r1 + r2 + r3 + r4 + …..)
5
many or all policies have infinite expected reward
5
some MDPs are ok (e.g., zero-cost absorbing states)
h
“Trick”: introduce discount factor 0 
β
 < 1
5
future rewards discounted by 
β
 per time step
 
 
 
h
Note:
 
h
Motivation: economic? prob of death? convenience?
 
Bounded Value
 
 
 
 
5
N
o
t
e
s
:
 
D
i
s
c
o
u
n
t
e
d
 
I
n
f
i
n
i
t
e
 
H
o
r
i
z
o
n
 
h
Optimal policies guaranteed to exist (Howard, 1960)
5
I.e. there is a policy that maximizes value at each state
h
Furthermore there is always an optimal stationary
policy
5
Intuition: why would we change action at s at a new time
when there is always forever ahead
h
We define              to be the optimal value function.
5
That is,                               for some optimal stationary 
π
 
 
 
6
 
C
o
m
p
u
t
a
t
i
o
n
a
l
 
P
r
o
b
l
e
m
s
 
 
7
P
o
l
i
c
y
 
E
v
a
l
u
a
t
i
o
n
 
h
Value equation for fixed policy
 
 
 
 
 
 
 
h
Equation can be derived from original definition of
infinite horizon discounted value
 
 
immediate reward
 
discounted expected value
of following policy in the future
 
 
 
8
 
P
o
l
i
c
y
 
E
v
a
l
u
a
t
i
o
n
 
 
 
 
 
10
 
P
o
l
i
c
y
 
E
v
a
l
u
a
t
i
o
n
 
v
i
a
 
M
a
t
r
i
x
 
I
n
v
e
r
s
e
 
V
π
 
a
n
d
 
R
 
a
r
e
 
n
-
d
i
m
e
n
s
i
o
n
a
l
 
c
o
l
u
m
n
 
v
e
c
t
o
r
 
(
o
n
e
e
l
e
m
e
n
t
 
f
o
r
 
e
a
c
h
 
s
t
a
t
e
)
T
 
i
s
 
a
n
 
 
n
x
n
 
m
a
t
r
i
x
 
s
.
t
.
 
 
11
C
o
m
p
u
t
i
n
g
 
a
n
 
O
p
t
i
m
a
l
 
V
a
l
u
e
 
F
u
n
c
t
i
o
n
 
 
h
B
e
l
l
m
a
n
 
e
q
u
a
t
i
o
n
 
f
o
r
 
o
p
t
i
m
a
l
 
v
a
l
u
e
 
f
u
n
c
t
i
o
n
 
 
 
 
 
 
h
Bellman proved this is always true for an optimal
value function
 
immediate reward
 
discounted expected value
of best action assuming we
we get optimal value in future
 
 
12
C
o
m
p
u
t
i
n
g
 
a
n
 
O
p
t
i
m
a
l
 
V
a
l
u
e
 
F
u
n
c
t
i
o
n
 
h
B
e
l
l
m
a
n
 
e
q
u
a
t
i
o
n
 
f
o
r
 
o
p
t
i
m
a
l
 
v
a
l
u
e
 
f
u
n
c
t
i
o
n
 
 
h
How can we solve this equation for V*?
5
The MAX operator makes the system non-linear, so the problem is
more difficult than policy evaluation
 
h
I
d
e
a
:
 
l
e
t
s
 
p
r
e
t
e
n
d
 
t
h
a
t
 
w
e
 
h
a
v
e
 
a
 
f
i
n
i
t
e
,
 
b
u
t
 
v
e
r
y
,
 
v
e
r
y
l
o
n
g
,
 
h
o
r
i
z
o
n
 
a
n
d
 
a
p
p
l
y
 
f
i
n
i
t
e
-
h
o
r
i
z
o
n
 
v
a
l
u
e
 
i
t
e
r
a
t
i
o
n
5
Adjust Bellman Backup to take discounting into account.
 
 
 
B
e
l
l
m
a
n
 
B
a
c
k
u
p
s
 
(
R
e
v
i
s
i
t
e
d
)
a
1
a
2
 
 
0.7
 
0.3
 
0.4
 
0.6
 
V
k
+
1
(
s
)
 
s
 
 
14
V
a
l
u
e
 
I
t
e
r
a
t
i
o
n
 
h
Can compute optimal policy using value iteration
based on Bellman backups, just like finite-horizon
problems (but include discount term)
 
h
Will it converge to optimal value function as k gets
large?
5
Yes.
h
Why?
;; Could also initialize to R(s)
 
 
15
C
o
n
v
e
r
g
e
n
c
e
 
o
f
 
V
a
l
u
e
 
I
t
e
r
a
t
i
o
n
 
h
B
e
l
l
m
a
n
 
B
a
c
k
u
p
 
O
p
e
r
a
t
o
r
:
 
 
d
e
f
i
n
e
 
B
 
t
o
 
b
e
 
a
n
o
p
e
r
a
t
o
r
 
t
h
a
t
 
t
a
k
e
s
 
a
 
v
a
l
u
e
 
f
u
n
c
t
i
o
n
 
V
 
a
s
 
i
n
p
u
t
 
a
n
d
r
e
t
u
r
n
s
 
a
 
n
e
w
 
v
a
l
u
e
 
f
u
n
c
t
i
o
n
 
a
f
t
e
r
 
a
 
B
e
l
l
m
a
n
 
b
a
c
k
u
p
 
 
 
h
Value iteration is just the iterative application of B:
 
 
 
 
 
 
 
 
 
 
17
 
C
o
n
v
e
r
g
e
n
c
e
:
 
F
i
x
e
d
 
P
o
i
n
t
 
P
r
o
p
e
r
t
y
 
h
B
e
l
l
m
a
n
 
e
q
u
a
t
i
o
n
 
f
o
r
 
o
p
t
i
m
a
l
 
v
a
l
u
e
 
f
u
n
c
t
i
o
n
 
 
 
h
F
i
x
e
d
 
P
o
i
n
t
 
P
r
o
p
e
r
t
y
:
 
 
T
h
e
 
o
p
t
i
m
a
l
 
v
a
l
u
e
 
f
u
n
c
t
i
o
n
 
i
s
a
 
f
i
x
e
d
-
p
o
i
n
t
 
o
f
 
t
h
e
 
B
e
l
l
m
a
n
 
B
a
c
k
u
p
 
o
p
e
r
a
t
o
r
 
B
.
5
That is B[V*]=V*
 
 
 
18
 
C
o
n
v
e
r
g
e
n
c
e
:
 
C
o
n
t
r
a
c
t
i
o
n
 
P
r
o
p
e
r
t
y
 
h
Let ||V|| denote the max-norm of V, which returns
the maximum absolute value of the vector.
5
E.g.  ||(0.1  -100  5  6)|| = 100
h
B
[
V
]
 
i
s
 
a
 
c
o
n
t
r
a
c
t
i
o
n
 
o
p
e
r
a
t
o
r
 
w
r
t
 
m
a
x
-
n
o
r
m
 
 
 
 
F
o
r
 
a
n
y
 
V
 
a
n
d
 
V
,
 
 
|
|
 
B
[
V
]
 
 
B
[
V
]
 
|
|
 
 
β
 
|
|
 
V
 
 
V
 
|
|
5
You will prove this.
h
That is, applying B to any two value functions
causes them to get closer together in the max-
norm sense!
 
 
19
C
o
n
v
e
r
g
e
n
c
e
 
h
Using the properties of B we can prove convergence of
value iteration.
h
Proof:
1.
For any V:   
||
 
V* - B[V] 
|| = 
|| B[V*] – B[V] 
|| ≤ 
β
||
 
V* - V
||
2.
So applying Bellman backup to any value function V
brings us closer to V* by a constant factor 
β
||V* - V
k+1 
||  = 
||V* - B[V
k 
]||  ≤ 
β
 ||
 
V* - V
k
 
||
 
3.
This means that 
||V* – V
k
|| ≤ 
β
k
 ||
 
V* - V
0
 
||
 
4.
Thus
 
 
20
V
a
l
u
e
 
I
t
e
r
a
t
i
o
n
:
 
S
t
o
p
p
i
n
g
 
C
o
n
d
i
t
i
o
n
 
h
Want to stop when we can guarantee the value
function is near optimal.
h
K
e
y
 
p
r
o
p
e
r
t
y
:
 
(
n
o
t
 
h
a
r
d
 
t
o
 
p
r
o
v
e
)
I
f
 
 
|
|
V
k
 
-
 
V
k
-
1
|
|
 
ε
 
t
h
e
n
 
|
|
V
k
 
 
V
*
|
|
 
 
ε
β
 
/
(
1
-
β
)
h
Continue iteration until ||V
k
 -
 
V
k-1
||≤ 
ε
5
Select small enough 
ε
 for desired error
guarantee
 
 
21
H
o
w
 
t
o
 
A
c
t
 
h
Given a V
k
 from value iteration that closely
approximates V*, what should we use as our
policy?
 
h
Use 
greedy
 policy: (one step lookahead)
 
 
 
h
Note that the value of greedy policy may not
be exactly equal to V
k
5
Why?
 
 
22
H
o
w
 
t
o
 
A
c
t
 
h
Use 
greedy
 policy: (one step lookahead)
 
 
 
h
W
e
 
c
a
r
e
 
a
b
o
u
t
 
t
h
e
 
v
a
l
u
e
 
o
f
 
t
h
e
 
g
r
e
e
d
y
 
p
o
l
i
c
y
w
h
i
c
h
 
w
e
 
d
e
n
o
t
e
 
b
y
 
V
g
5
This is how good the greedy policy will be in practice.
 
h
How close is V
g 
to V*?
 
 
23
V
a
l
u
e
 
o
f
 
G
r
e
e
d
y
 
P
o
l
i
c
y
 
 
 
h
Define V
g
 to be the value of this greedy policy
5
This is likely not the same as V
k
h
P
r
o
p
e
r
t
y
:
 
I
f
 
|
|
V
k
 
 
V
*
|
|
 
 
λ
 
t
h
e
n
 
|
|
V
g
 
-
 
V
*
|
|
 
 
 
2
λ
β
 
/
(
1
-
β
)
5
Thus, V
g 
is not too far from optimal if V
k
 is close to optimal
h
Our previous stopping condition allows us to bound 
λ
 based
on ||V
k+1
 – V
k
||
 
h
Set stopping condition so that ||V
g 
- V*|| ≤ 
Δ
5
How?
 
 
 
P
r
o
p
e
r
t
y
:
 
I
f
 
|
|
V
k
 
 
V
*
|
|
 
 
λ
 
t
h
e
n
 
|
|
V
g
 
-
 
V
*
|
|
 
 
 
2
λ
β
 
/
(
1
-
β
)
 
P
r
o
p
e
r
t
y
:
 
I
f
 
 
|
|
V
k
 
-
 
V
k
-
1
|
|
 
ε
 
t
h
e
n
 
|
|
V
k
 
 
V
*
|
|
 
 
ε
β
 
/
(
1
-
β
)
 
G
o
a
l
:
 
|
|
V
g
 
-
 
V
*
|
|
 
 
Δ
 
 
P
o
l
i
c
y
 
E
v
a
l
u
a
t
i
o
n
 
R
e
v
i
s
i
t
e
d
 
h
Sometimes policy evaluation is expensive due to
matrix operations
h
Can we have an iterative algorithm like value
iteration for policy evaluation?
h
I
d
e
a
:
 
G
i
v
e
n
 
a
 
p
o
l
i
c
y
 
π
 
a
n
d
 
M
D
P
 
M
,
 
c
r
e
a
t
e
 
a
 
n
e
w
M
D
P
 
M
[
π
]
 
t
h
a
t
 
i
s
 
i
d
e
n
t
i
c
a
l
 
t
o
 
M
,
 
e
x
c
e
p
t
 
t
h
a
t
 
i
n
e
a
c
h
 
s
t
a
t
e
 
s
 
w
e
 
o
n
l
y
 
a
l
l
o
w
 
a
 
s
i
n
g
l
e
 
a
c
t
i
o
n
 
π
(
s
)
5
What is the optimal value function V* for M[
π
]
 ?
h
Since the only valid policy for M[
π
] 
is 
π
, V* = V
π
.
 
 
P
o
l
i
c
y
 
E
v
a
l
u
a
t
i
o
n
 
R
e
v
i
s
i
t
e
d
 
h
Running VI on M[
π
] 
will converge to 
V* = V
π
.
5
What does the Bellman backup look like here?
 
h
The Bellman backup now only considers one
action in each state, so there is no max
5
We are effectively applying a backup restricted by 
π
 
 
 
R
e
s
t
r
i
c
t
e
d
 
B
e
l
l
m
a
n
 
B
a
c
k
u
p
:
 
 
 
27
 
I
t
e
r
a
t
i
v
e
 
P
o
l
i
c
y
 
E
v
a
l
u
a
t
i
o
n
 
 
h
Running VI on M[
π
] 
is equivalent to iteratively
applying the restricted Bellman backup.
 
 
 
 
h
Often become close to 
V
π
 
for small k
 
I
t
e
r
a
t
i
v
e
 
P
o
l
i
c
y
 
E
v
a
l
u
a
t
i
o
n
:
 
C
o
n
v
e
r
g
e
n
c
e
:
 
 
 
28
O
p
t
i
m
i
z
a
t
i
o
n
 
v
i
a
 
P
o
l
i
c
y
 
I
t
e
r
a
t
i
o
n
h
Policy iteration uses policy evaluation as a sub
routine for optimization
h
It iterates steps of 
policy evaluation 
and 
policy
improvement
 
1. Choose a random policy 
π
2. Loop:
   (a) 
Evaluate V
π
   (b) 
π
’ = ImprovePolicy(V
π
)
   (c) Replace 
π
 with 
π
Until no improving action possible at any state
 
Given 
V
π
 
 returns a strictly
better policy if 
π
 isn’t
optimal
 
 
29
P
o
l
i
c
y
 
I
m
p
r
o
v
e
m
e
n
t
 
h
Given 
V
π
 
 how can we compute a policy 
π
that is
strictly better than a sub-optimal 
π
?
h
I
d
e
a
:
 
g
i
v
e
n
 
a
 
s
t
a
t
e
 
s
,
 
t
a
k
e
 
t
h
e
 
a
c
t
i
o
n
 
t
h
a
t
 
l
o
o
k
s
 
t
h
e
b
e
s
t
 
a
s
s
u
m
i
n
g
 
t
h
a
t
 
w
e
 
f
o
l
l
o
w
i
n
g
 
p
o
l
i
c
y
 
π
 
t
h
e
r
e
a
f
t
e
r
5
That is, assume the next state s’ has value 
V
π
 
(s’)
 
 
P
r
o
p
o
s
i
t
i
o
n
:
 
V
π
 
 
 
V
π
 
 
w
i
t
h
 
s
t
r
i
c
t
 
i
n
e
q
u
a
l
i
t
y
 
f
o
r
 
s
u
b
-
o
p
t
i
m
a
l
 
π
.
 
For each s in S, set
 
 
 
30
P
r
o
p
o
s
i
t
i
o
n
:
 
V
π
 
 
 
V
π
 
 
w
i
t
h
 
s
t
r
i
c
t
 
i
n
e
q
u
a
l
i
t
y
 
f
o
r
 
s
u
b
-
o
p
t
i
m
a
l
 
π
.
 
 
31
P
r
o
p
o
s
i
t
i
o
n
:
 
V
π
 
 
 
V
π
 
 
w
i
t
h
 
s
t
r
i
c
t
 
i
n
e
q
u
a
l
i
t
y
 
f
o
r
 
s
u
b
-
o
p
t
i
m
a
l
 
π
.
 
 
 
32
P
r
o
p
o
s
i
t
i
o
n
:
 
V
π
 
 
 
V
π
 
 
w
i
t
h
 
s
t
r
i
c
t
 
i
n
e
q
u
a
l
i
t
y
 
f
o
r
 
s
u
b
-
o
p
t
i
m
a
l
 
π
.
 
 
 
33
 
O
p
t
i
m
i
z
a
t
i
o
n
 
v
i
a
 
P
o
l
i
c
y
 
I
t
e
r
a
t
i
o
n
 
1. Choose a random policy 
π
2. Loop:
   (a) 
Evaluate V
π
   (b) For each s in S, set
   (c) Replace 
π
 with 
π
Until no improving action possible at any state
P
r
o
p
o
s
i
t
i
o
n
:
 
V
π
 
 
 
V
π
 
 
w
i
t
h
 
s
t
r
i
c
t
 
i
n
e
q
u
a
l
i
t
y
 
f
o
r
 
s
u
b
-
o
p
t
i
m
a
l
 
π
.
 
Policy iteration goes through a sequence of improving policies
 
 
 
34
 
P
o
l
i
c
y
 
I
t
e
r
a
t
i
o
n
:
 
C
o
n
v
e
r
g
e
n
c
e
 
h
Convergence assured in a finite number of
iterations
5
Since finite number of policies and each step
improves value, then must converge to optimal
h
Gives exact value of optimal policy
 
 
35
P
o
l
i
c
y
 
I
t
e
r
a
t
i
o
n
 
C
o
m
p
l
e
x
i
t
y
 
h
Each iteration runs in polynomial time in the
number of states and actions
h
There are at most |A|
n
 policies and PI never
repeats a policy
5
So at most an exponential number of iterations
5
Not a very good complexity bound
h
Empirically O(n) iterations are required often
it seems like O(1)
5
C
h
a
l
l
e
n
g
e
:
 
t
r
y
 
t
o
 
g
e
n
e
r
a
t
e
 
a
n
 
M
D
P
 
t
h
a
t
 
r
e
q
u
i
r
e
s
m
o
r
e
 
t
h
a
n
 
t
h
a
t
 
n
 
i
t
e
r
a
t
i
o
n
s
h
Recent polynomial bounds.
 
 
 
36
 
V
a
l
u
e
 
I
t
e
r
a
t
i
o
n
 
v
s
.
 
P
o
l
i
c
y
 
I
t
e
r
a
t
i
o
n
 
h
Which is faster? VI or PI
5
It depends on the problem
h
VI takes more iterations than PI, but PI
requires more time on each iteration
5
PI must perform policy evaluation on each
iteration which involves solving a linear system
h
VI is easier to implement since it does not
require the policy evaluation step
5
But see next slide
h
We will see that both algorithms will serve as
inspiration for more advanced algorithms
 
 
 
 
37
 
M
o
d
i
f
i
e
d
 
P
o
l
i
c
y
 
I
t
e
r
a
t
i
o
n
 
h
M
o
d
i
f
i
e
d
 
P
o
l
i
c
y
 
I
t
e
r
a
t
i
o
n
:
 
r
e
p
l
a
c
e
s
 
e
x
a
c
t
p
o
l
i
c
y
 
e
v
a
l
u
a
t
i
o
n
 
s
t
e
p
 
w
i
t
h
 
i
n
e
x
a
c
t
 
i
t
e
r
a
t
i
v
e
e
v
a
l
u
a
t
i
o
n
5
Uses a small number of restricted Bellman
backups for evaluation
 
h
Avoids the expensive policy evaluation step
h
Perhaps easier to implement.
h
Often is faster than PI and VI
h
Still guaranteed to converge under mild
assumptions on starting points
 
 
 
M
o
d
i
f
i
e
d
 
P
o
l
i
c
y
 
I
t
e
r
a
t
i
o
n
 
1. Choose initial value function V
2. Loop:
   (a) 
For each s in S, set
   (b) Partial Policy Evaluation
         Repeat K times:
 
Until change in V is minimal
 
P
o
l
i
c
y
 
I
t
e
r
a
t
i
o
n
 
Approx.
evaluation
 
 
 
39
 
R
e
c
a
p
:
 
t
h
i
n
g
s
 
y
o
u
 
s
h
o
u
l
d
 
k
n
o
w
 
h
What is an MDP?
h
What is a policy?
5
Stationary and non-stationary
h
What is a value function?
5
Finite-horizon and infinite horizon
h
How to evaluate policies?
5
Finite-horizon and infinite horizon
5
Time/space complexity?
h
How to optimize policies?
5
Finite-horizon and infinite horizon
5
Time/space complexity?
5
Why they are correct?
Slide Note
Embed
Share

In the realm of Markov Decision Processes (MDPs), tackling infinite horizon problems involves defining value functions, introducing discount factors, and guaranteeing the existence of optimal policies. Computational challenges like policy evaluation and optimization are addressed through algorithms like value iteration and policy iteration.


Uploaded on Sep 10, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1

  2. What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output:a policy that achieves an optimal value This depends on how we define the value of a policy There are several choices and the solution algorithms depend on the choice We will consider two common choices Finite-Horizon Value Infinite Horizon Discounted Value 2

  3. Discounted Infinite Horizon MDPs Defining value as total reward is problematic with infinite horizons (r1 + r2 + r3 + r4 + ..) many or all policies have infinite expected reward some MDPs are ok (e.g., zero-cost absorbing states) Trick : introduce discount factor 0 < 1 future rewards discounted by per time step = t t ( ) [ | , ] V s E R s Bounded Value = = 0 t 1 max max t = Note: ( ) [ ] V s E R R 1 0 t Motivation: economic? prob of death? convenience? 3

  4. Notes: Discounted Infinite Horizon Optimal policies guaranteed to exist (Howard, 1960) I.e. there is a policy that maximizes value at each state Furthermore there is always an optimal stationary policy Intuition: why would we change action at s at a new time when there is always forever ahead We define to be the optimal value function. ) ( * s V ) ( ) ( * s V s V = That is, for some optimal stationary 5

  5. Computational Problems Policy Evaluation Given ? and an MDP compute ?? Policy Optimization Given an MDP, compute an optimal policy ? and ? . We ll cover two algorithms for doing this: value iteration and policy iteration 6

  6. Policy Evaluation Value equation for fixed policy = + ( ) ( ) ( , ( ), ) ' s ( ) ' s V s R s T s s V ' s immediate reward discounted expected value of following policy in the future Equation can be derived from original definition of infinite horizon discounted value 7

  7. Policy Evaluation Value equation for fixed policy = + ( ) ( ) ( , ( ), ) ' s ( ) ' s V s R s T s s V ' s How can we compute the value function for a fixed policy? we are given R, T, ?, and want to find ??? for each s linear system with n variables and n constraints Variables are values of states: V(s1), ,V(sn) Constraints: one value equation (above) per state Use linear algebra to solve for V (e.g. matrix inverse) 8

  8. Policy Evaluation via Matrix Inverse V and R are n-dimensional column vector (one element for each state) T(i, T(s = j) , ( ), ) i s s T is an nxn matrix s.t. i j = + V R V T = ( I V R T) 1 - ( V I R T) = 10

  9. Computing an Optimal Value Function Bellman equation for optimal value function + = a * ( ) ( ) max ( , , ) ' s * ( ) ' s V s R s T s a V ' s immediate reward discounted expected value of best action assuming we we get optimal value in future Bellman proved this is always true for an optimal value function 11

  10. Computing an Optimal Value Function Bellman equation for optimal value function a = + * ( ) ( ) max ( , , ) ' s * ( ) ' s V s R s T s a V ' s How can we solve this equation for V*? The MAX operator makes the system non-linear, so the problem is more difficult than policy evaluation Idea: lets pretend that we have a finite, but very, very long, horizon and apply finite-horizon value iteration Adjust Bellman Backup to take discounting into account. 12

  11. Bellman Backups (Revisited) Vk Compute Expectations s1 Compute Max 0.7 0.3 a1 s2 Vk+1(s) s s3 0.4 a2 0.6 s4 a + = + 1 k k ( ) ( ) max ( , , ) ' s ( ) ' s V s R s T s a V ' s

  12. Value Iteration Can compute optimal policy using value iteration based on Bellman backups, just like finite-horizon problems (but include discount term) = 0 ( ) 0 V s ;; Could also initialize to R(s) a = + 1 k k ( ) ( ) max ( , , ) ' s ( ) ' s V s R s T s a V ' s Will it converge to optimal value function as k gets large? Yes. lim V Vk k = * Why? 14

  13. Convergence of Value Iteration Bellman Backup Operator: define B to be an operator that takes a value function V as input and returns a new value function after a Bellman backup a = + [ ]( ) ( ) max ( , , ) ' s ( ) ' s B V s R s T s a V ' s Value iteration is just the iterative application of B: = 0 0 V = 1 k k [ ] V B V 15

  14. Convergence: Fixed Point Property Bellman equation for optimal value function + = a * ( ) ( ) max ( , , ) ' s * ( ) ' s V s R s T s a V ' s Fixed Point Property: The optimal value function is a fixed-point of the Bellman Backup operator B. That is B[V*]=V* a = + [ ]( ) ( ) max ( , , ) ' s ( ) ' s B V s R s T s a V ' s 17

  15. Convergence: Contraction Property Let ||V|| denote the max-norm of V, which returns the maximum absolute value of the vector. E.g. ||(0.1 -100 5 6)|| = 100 B[V] is a contraction operator wrt max-norm For any V and V , || B[V] B[V ] || || V V || You will prove this. That is, applying B to any two value functions causes them to get closer together in the max- norm sense! 18

  16. Convergence Using the properties of B we can prove convergence of value iteration. Proof: 1. For any V: ||V* - B[V] || = || B[V*] B[V] || ||V* - V|| 2. So applying Bellman backup to any value function V brings us closer to V* by a constant factor ||V* - Vk+1 || = ||V* - B[Vk ]|| ||V* - Vk || 3. This means that ||V* Vk|| k ||V* - V0 || = * k lim 0 V V 4. Thus k 19

  17. Value Iteration: Stopping Condition Want to stop when we can guarantee the value function is near optimal. Key property: (not hard to prove) If ||Vk -Vk-1|| then ||Vk V*|| /(1- ) Continue iteration until ||Vk -Vk-1|| Select small enough for desired error guarantee 20

  18. How to Act Given a Vk from value iteration that closely approximates V*, what should we use as our policy? Use greedy policy: (one step lookahead) = k k [ ]( ) arg max a ( , , ) ' s ( ) ' s greedy V s T s a V ' s Note that the value of greedy policy may not be exactly equal to Vk Why? 21

  19. How to Act Use greedy policy: (one step lookahead) = k k [ ]( ) arg max a ( , , ) ' s ( ) ' s greedy V s T s a V ' s We care about the value of the greedy policy which we denote by Vg This is how good the greedy policy will be in practice. How close is Vg to V*? 22

  20. Value of Greedy Policy = k k [ ]( ) arg max a ( , , ) ' s ( ) ' s greedy V s T s a V ' s Define Vg to be the value of this greedy policy This is likely not the same as Vk Property: If ||Vk V*|| then ||Vg - V*|| 2 /(1- ) Thus, Vg is not too far from optimal if Vk is close to optimal Our previous stopping condition allows us to bound based on ||Vk+1 Vk|| Set stopping condition so that ||Vg - V*|| How? 23

  21. Goal: ||Vg - V*|| Property: If ||Vk V*|| then ||Vg - V*|| 2 /(1- ) Property: If ||Vk -Vk-1|| then ||Vk V*|| /(1- ) Answer: If ||Vk -Vk-1|| 1 2 /(2 2) then ||Vg - V*||

  22. Policy Evaluation Revisited Sometimes policy evaluation is expensive due to matrix operations Can we have an iterative algorithm like value iteration for policy evaluation? Idea: Given a policy and MDP M, create a new MDP M[ ] that is identical to M, except that in each state s we only allow a single action (s) What is the optimal value function V* for M[ ] ? Since the only valid policy for M[ ] is , V* = V .

  23. Policy Evaluation Revisited Running VI on M[ ] will converge to V* = V . What does the Bellman backup look like here? The Bellman backup now only considers one action in each state, so there is no max We are effectively applying a backup restricted by Restricted Bellman Backup: = + [ ]( ) ( ) ( , ( ), ) ' s ( ) ' s B V s R s T s s V ' s

  24. Iterative Policy Evaluation Running VI on M[ ] is equivalent to iteratively applying the restricted Bellman backup. Iterative Policy Evaluation: = 0 0 V + = 1 k k [ ] V B V = Vk lim V Convergence: k Often become close to V for small k 27

  25. Optimization via Policy Iteration Policy iteration uses policy evaluation as a sub routine for optimization It iterates steps of policy evaluation and policy improvement 1. Choose a random policy 2. Loop: (a) Evaluate V (b) = ImprovePolicy(V ) (c) Replace with Until no improving action possible at any state Given V returns a strictly better policy if isn t optimal 28

  26. Policy Improvement Given V how can we compute a policy that is strictly better than a sub-optimal ? Idea: given a state s, take the action that looks the best assuming that we following policy thereafter That is, assume the next state s has value V (s ) = ( ' ) arg max a ( , , ) ' s ( ) ' s s T s a V For each s in S, set ' s Proposition:V V with strict inequality for sub- optimal . 29

  27. For any two value functions ?1 and ?2, we write ?1 ?2 to indicate that for all states s, ?1? ?2? . = ( ' ) arg max a ( , , ) ' s ( ) ' s s T s a V ' s Proposition:V V with strict inequality for sub-optimal . Useful Properties for Proof: 1) ??= B?V? 2) ? ?? = ?? ?? ;; by the definition of ? 3) For any ?1,?2 and ?, if ?1 ?2 then ???1 ??[?2] 30

  28. = ( ' ) arg max a ( , , ) ' s ( ) ' s s T s a V ' s Proposition:V V with strict inequality for sub-optimal . Proof: (first part, non-strict inequality) We know that ??= ???? ? ?? = ?? ?? So we have that ?? ?? ??. 2?? where ?? ? Now by monotonicity we get ?? ?? ?? denotes ? applications of ?? . We can continue and derive that in general for any ?, ?? any ?. ??? ?? ?+1??, which also implies that ?? ?? ??? for ??? = ?? Thus V? lim ? ?? 31

  29. = ( ' ) arg max a ( , , ) ' s ( ) ' s s T s a V ' s Proposition:V V with strict inequality for sub-optimal . Proof: (part two, strict inequality) We want to show that if ? is sub-optimal then ?? > ??. We prove the contrapositive if ?? > ?? then ? is optimal. Since we already showed that ?? ?? we know that the condition of the contrapositive ?? > ?? is equivalent to ?? = ??. Now assume that ?? = ??. Combining this with ?? = ?? ?? yields ??= ?? ?? = ? ??. Thus ?? satisfies the Bellman Equation and must be optimal. 32

  30. Optimization via Policy Iteration 1. Choose a random policy 2. Loop: (a) Evaluate V (b) For each s in S, set (c) Replace with Until no improving action possible at any state = ( ' ) arg max a ( , , ) ' s ( ) ' s s T s a V ' s 33 Proposition:V V with strict inequality for sub-optimal . Policy iteration goes through a sequence of improving policies

  31. Policy Iteration: Convergence Convergence assured in a finite number of iterations Since finite number of policies and each step improves value, then must converge to optimal Gives exact value of optimal policy 34

  32. Policy Iteration Complexity Each iteration runs in polynomial time in the number of states and actions There are at most |A|n policies and PI never repeats a policy So at most an exponential number of iterations Not a very good complexity bound Empirically O(n) iterations are required often it seems like O(1) Challenge: try to generate an MDP that requires more than that n iterations Recent polynomial bounds. 35

  33. Value Iteration vs. Policy Iteration Which is faster? VI or PI It depends on the problem VI takes more iterations than PI, but PI requires more time on each iteration PI must perform policy evaluation on each iteration which involves solving a linear system VI is easier to implement since it does not require the policy evaluation step But see next slide We will see that both algorithms will serve as inspiration for more advanced algorithms 36

  34. Modified Policy Iteration Modified Policy Iteration: replaces exact policy evaluation step with inexact iterative evaluation Uses a small number of restricted Bellman backups for evaluation Avoids the expensive policy evaluation step Perhaps easier to implement. Often is faster than PI and VI Still guaranteed to converge under mild assumptions on starting points 37

  35. Modified Policy Iteration Policy Iteration 1. Choose initial value function V 2. Loop: (a) For each s in S, set (b) Partial Policy Evaluation Repeat K times: = ( ) arg max a ( , , ) ' s ( ) ' s s T s a V ' s Approx. evaluation [V ] V B Until change in V is minimal

  36. Recap: things you should know What is an MDP? What is a policy? Stationary and non-stationary What is a value function? Finite-horizon and infinite horizon How to evaluate policies? Finite-horizon and infinite horizon Time/space complexity? How to optimize policies? Finite-horizon and infinite horizon Time/space complexity? Why they are correct? 39

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#