Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

Solutions Vol.
II, Chapter 1
1.5
(a) We have
n
j=1
p
ij
(u) =
n
j=1
_
p
ij
(u) m
j
1
n
k=1
m
k
_
=
n
j=1
p
ij
(u)
n
j=1
m
j
1
n
k=1
m
k
= 1.
Therefore, p
ij
(u) are transition probabilities.
(b) We have for the modied problem
J
(i) = min
uU(i)
_
_
g(i, u) +
_
_
1
n
j=1
m
j
_
_
n
j=1
p
ij
(u) m
j
1
n
k=1
m
k
J
(j)
_
_
= min
uU(i)
_
_
g(i, u) +
n
j=1
p
ij
(u)J
(j)
n
k=1
m
k
J
(k)
_
_
.
So
J
(i) +

n
k=1
m
k
J
(k)
1
= min
uU(i)
_
_
g(i, u) +
n
j=1
p
ij
(u)J
(j)
n
k=1
m
k
(1
1
1
)
. .
1
J
(k)
_
_
J
(i) +

n
k=1
m
k
J
(k)
1
= min
uU(i)
_
_
g(i, u) +
n
j=1
p
ij
(u)
_
J
(j) +

n
k=1
m
k
J
(k)
1
_
_
_
.
Thus
J
(i) +

n
k=1
m
k
J
(k)
1
= J
(i), i.
Q.E.D.
1.7
We show that for any bounded function J : S R, we have
J T(J) T(J) F(J), (1)
37
J T(J) T(J) F(J). (2)
For any , dene
F
(J)(i) =
g(i, (i)) +
j=i
p
ij
((i))J(j)
1 p
ii
((i))
and note that
F
(J)(i) =
T
(J)(i) p
ii
((i))J(i)
1 p
ii
((i))
. (3)
Fix > 0. If J T(J), let be such that F
(J) F(J) +e. Then, using Eq. (3),

F(J)(i) + F
(J)(i) =
T
(J)(i) p
ii
((i))J(i)
1 p
ii
((i))

T(J)(i) p
ii
((i))T(J)(i)
1 p
ii
((i))
= T(J)(i).
Since > 0 is arbitrary, we obtain F(J)(i) T(J)(i). Similarly, if J T(J), let be such that
T
(J) T(J) +e. Then, using Eq. (3),

F(J)(i) F
(J)(i) =
T
(J)(i) p
ii
((i))J(i)
1 p
ii
((i))

T(J)(i) + p
ii
((i))T(J)(i)
1 p
ii
((i))
T(J)(i) +

1
.
Since > 0 is arbitrary, we obtain F(J)(i) T(J)(i).
From (1) and (2) we see that F and T have the same xed points, so J
is the unique xed point

of F. Using the denition of F, it can be seen that for any scalar r > 0 we have
F(J +re) F(J) +re, F(J) re F(J re). (4)
Furthermore, F is monotone, that is
J J
F(J) F(J
). (5)
For any bounded function J, let r > 0 be such that
J re J
J +re.
Applying F repeatedly to this equation and using Eqs. (4) and (5), we obtain
F
k
(J)
k
re J
F
k
(J) +
k
re.
Therefore F
k
(J) converges to J
. From Eqs. (1), (2), and (5) we see that

J T(J) T
k
(J) F
k
(J) J
,
J T(J) T
k
(J) F
k
(J) J
.
These equations demonstrate the faster convergence property of F over T.
38
As a nal result (not explicitly required in the problem statement), we show that for any two
bounded functions J : S R, J
: S R, we have
max
j
|F(J)(j) F(J
)(j)| max
j
|J(j) J
(j)|, (6)
so F is a contraction mapping with modulus . Indeed, we have
F(J)(i) = min
uU(i)
_
g(i, u) +
j=i
p
ij
(u)J(j)
1 p
ii
(u)
_
= min
uU(i)
_
g(i, u) +
j=i
p
ij
(u)J
(j)
1 p
ii
(u)
+

j=i
p
ij
(u)[J(j) J
(j)]
1 p
ii
(u)
_
F(J
)(i) + max
j
|J(j) J
(j)|, i,
where we have used the fact
1 p
ii
(u) 1 p
ii
(u) =
j=i
p
ij
(u).
Thus, we have
F(J)(i) F(J
)(i) max
j
|J(j) J
(j)|, i.
The roles of J and J
may be reversed, so we can also obtain

F(J
)(i) F(J)(i) max

j
|J(j) J
(j)|, i.
Combining the last two inequalities, we see that
|F(J)(i) F(J
)(i)| max
j
|J(j) J
(j)|, i.
By taking the maximum over i, Eq. (6) follows.
1.9
(a) Since J, J
B(S), i.e., are real-valued, bounded functions on S, we know that the inmum and the
supremum of their dierence is nite. We shall denote
m = min
xS
_
J(x) J
(x)
_
and
M = max
xs
_
J(x) J
(x)
_
.
Thus
m J(x) J
(x) M, x S,
39
or
J
(x) +m J(x) J
(x) +M, x S.
Now we apply the mapping T on the above inequalities. By property (1) we know that T will preserve
the inequalities. Thus
T(J
+me)(x) T(J)(x) T(J
+Me)(x), x S.
By property (2) we know that
T(J)(x) + min[a
1
r, a
2
r] T(J +re)(x) T(J)(x) + max[a
1
r, a
2
r].
If we replace r by m or M, we get the inequalities
T(J
)(x) + min[a
1
m, a
2
m] T(J
+me)(x) T(J
)(x) + max[a
1
m, a
2
m]
and
T(J
)(x) + min[a
1
M, a
2
M] T(J
+Me)(x) T(J
)(x) + max[a
1
M, a
2
M].
Thus
T(J
)(x) + min[a
1
m, a
2
m] T(J)(x) T(J
)(x) + max[a
1
M, a
2
M],
so that
|T(J)(x) T(J
)(x)| max[a
1
|M|, a
2
|M|, a
1
|m), a
2
|m|].
We also have
max[a
1
|M|, a
2
|M|, a
1
|m|, a
1
|m|, a
2
|m|] a
2
max[|M|, |m|] a
2
sup
xS
|J(x) J
(x).
Thus
|T(J)(x) T(J
)(x)| a
2
max
xS
|J(x) J
(x)|
from which
max
xS
|T(J)(x) T(J
)(x)| a
2
max
xS
|J(x) J
(x)|.
Thus T is a contraction mapping since we know by the statement of the problem that 0 a
1
a
2
< 1.
Since the set B(S) of bounded real valued functions is a complete linear space, we conclude that
the contraction mapping T has a unique xed point, J
, and lim
k
T
k
(J)(x) = J
(x).
(b) We shall rst prove the lower bounds of J
(x). The upper bounds follow by a similar argument. Since

J, T(J) B(S), there exists a c , (c < ), such that
J(x) +c T(J)(x). (1)
40
We apply T on both sides of (1) and since T preserves the inequalities (by assumption (1)) we have by
applying the relation of assumption (2).
J(x) + min[c +a
1
c, c +a
2
c] T(J)(x) + min[a
1
c, a
2
c] T(J +ce)(x) T
2
(J)(x). (2)
Similarly, if we apply T again we get,
J(x) + min
i(1,2)
[c +a
i
c, c +a
2
i
c] T(J) + min[a
1
c +a
2
1
c, a
2
c +a
2
2
c]
T
2
(J) + min[a
2
1
c, a
2
2
c] T(T(J) + min[a
1
c, a
2
c]e)(x) T
3
(J)(x).
Thus by induction we conclude
J(x) + min[
k
m=0
a
m
1
c,
k
m=0
a
m
2
c] T(J)(x) + min[
k
m=1
a
m
1
c,
k
m=1
a
m
2
c] . . .
T
k
(J)(x) + min[a
k
1
c, a
k
2
c] T
k+1
(J)(x).
(3)
By taking the limit as k and noting that the quantities in the minimization are monotone, and
either nonnegative or nonpositive, we conclude that
J(x) + min
_
1
1 a
1
c,
1
1 a
2
c
_
T(J)(x) + min
_
a
1
1 a
1
c,
a
2
1 a
2
c
_
T
k
(J)(x) + min
_
a
k
1
1 a
1
c,
a
k
2
1 a
2
c
_
T
k+1
(J)(x) + min
_
a
k+1
1
1 a
1
c,
a
k+1
2
1 a
2
c
_
J
(x).
(4)
Finally we note that
min[a
k
1
c, a
k
2
c] T
k+1
(J)(x) T
k
(J)(x).
Thus
min[a
k
1
c, a
k
2
c] inf
xS
(T
k+1
(J)(x) T
k
(J)(x)) .
Let b
k+1
= inf
xS
(T
k+1
(J)(x) T
k
(J)(x)) . Thus min[a
k
1
c, a
k
2
c] b
k+1
. From the above relation we infer
that
min
_
a
k+1
1
c
1 a
1
,
a
k+1
2
c
1 a
2
_
min
_
a
1
1 a
1
b
k
1
,
a
2
1 a
2
b
k+1
_
= c
k+1
Therefore
T
k
(J)(x) + min
_
a
k
1
c
1 a
1
,
a
k
2
c
1 a
2
_
T
k+1
(J)(x) +c
k+1
.
This relationship gives for k = 1
T(J)(x) + min
_
a
1
c
1 a
1
,
a
2
c
1 a
2
_
T
2
(J)(x) +c
2
41
Let
c = inf
xS
(T(J)(x) J(x))
Then the above inequality still holds. From the denition of c
1
we have
c
1
= min
_
a
1
c
1 a
1
,
a
2
c
1 a
2
_
.
Therefore
T(J)(x) +c
1
T
2
(J)(x) +c
2
and T(J)(x) +c
1
J
(x) from Eq. (4). Similarly, let J

1
(x) = T(J)(x), and let
b
2
= min
xS
(T
2
(J)(x) T(J)(x)) = min
xS
(T(J
1
)(x) T(J
1
)(x)).
If we proceed as before, we get
J
1
(x) + min
_
1
1 a
3
b
2
,
1
1 a
2
b
2
_
T(J
1
)(x) + min
_
a
1
b
2
1 a
2
,
a
1
b
2
1 a
2
_
T
2
(J
1
)(x) + min
_
a
2
1
b
2
1 a
2
,
a
2
2
b
2
1 a
2
_
J
(x).
Then
min[a
1
b
2
, a
2
b
2
] min
xS
[T
2
(J
1
)(x) T(J
1
)(x)] = min
xS
[T
3
(J)(x) T
2
(J)(x)] = b
3
Thus
min
_
a
2
1
b
2
1 a
1
,
a
2
2
b
2
1 a
2
_
min
_
a
1
b
3
1 a
2
,
a
2
b
3
1 a
2
_
.
Thus
T(J
1
)(x) + min
_
a
1
b
2
1 a
2
,
a
2
b
2
1 a
2
_
T
2
(J
1
)(x) + min
_
a
1
b
3
1 a
2
,
a
2
b
2
1 a
2
_
or
T
2
(J)(x) +c
2
T
3
(J)(x) +c
3
and
T
2
(J)(x) +c
2
J
(x).
Proceeding similarly the result is proved.
The reverse inequalities can be proved by a similar argument.
(c) Let us rst consider the state x = 1
F(J)(1) = min
uU(1)
_
_
_
g(j, j) +a
n
j=1
p
1j
J(j)
_
_
_
42
Thus
F(J +re)(1) = min
uU(1)
_
_
_
g(1, u) +
n
j=1
p
ij
(J +re)(j)
_
_
_
= min
uU(1)
_
_
_
g(1, u) +
n
j=1
p
1j
J(j) +ar
_
_
_
= F(J)(1) +r
Thus
F(J +re)(1) F(J((1)
r
= (1)
Since 0 1 we conclude that
n
. Thus
F(J +re)(1) F(J)(1)

r
=
For the state x = 2 we proceed similarly and we get
F(J)(2) = min
uU(2)
_
g(2, u) +p
21
F(J)(1) +
n
J=2
p
2j
J(j)
_
and
F(J +re)(2) = min
uU(2)
_
g(2, u) +p
21
F(J +re)(1) +
n
J=2
p
2j
(J +re)(j)
_
= min
uU(2)
_
g(2, u) +p
21
F(J)(1) +
2
rp
21
+
n
J=2
p
2
J(j) +
n
J=2
p
ij
re(j)
_
where, for the last equality, we used relation (1).
Thus we conclude
F(J +re)(2) = F(J)(2) +
2
rp
21
+
n
j=2
p
2j
r = F(J)(2) +
2
rp
21
+r(1 p
21
)
which yields
F(J +re)(2) F(J)(2)
r
=
2
P
21
+(1 p
21
) (2)
Now let us study the behavior of the right-hand side of Eq. (2). We have 0 < < 1 and 0 < p
21
< 1, so
since
2
, and
2
p
21
+(1 p
21
) is a convex combination of
2
, , it is easy to see that
2

2
p
21
+ (1 p
21
) (3)
If we combine Eq. (2) with Eq. (3) we get
a
n

2
F(J +re)(2) F(J)(2)

r

which is the pursued result.
43
Claim:
F(J +re)(x) F(J)(x)

r

Proof: We shall employ an inductive argument. Obviously the result holds for x = 1, 2. Let us assume
that it holds for all x i. We shall prove it for x = i +j
F(J)(i + 1) = min
uU(i+1)
_
_
_
g(i + 1, u) +
i
j=1
p
1+ij
F(J)(j) +
n
j=i+1
p
i+1j
p
i+1j
J(j)
_
_
_
F(J +re)(i + 1) = min
uU(i+1)
_
_
_
g(i + 1, u) +
i
j=1
p
i+1j
F(J +re)(j) +
j=i+1
n
p
i+1,j
(J +re)(j)
_
_
_
We know
j
F(J +re)(j) , j i, thus
F(J)(i + 1) +r
j=1
F(J)(i + 1) +
2
rp +r(1 p)
where
p =
i
j=1
p
1+ij
Obviously
i
j=1
j
p
i+1j

i
i
j=1
p
i+1j
=
i
p
Thus
i+1
p +(1 p)
F(J +re)(j) F(J)(j)
r

2
p + (1 p)
Since 0 <
i+1

2
< 1 and 0 p i we conclude that
i+1
a
i+1
p+(1p) and a
2
p+(1p)
. Thus
i+1
F(J +re)(i + 1) F(J)(i + 1)

r

which completes the inductive proof.
Since 0
n

i
1 for i i n, the result follows.
(d) Let J(x) J
9x)(=)J
(x) J(x) 0 Since all the elements m

ij
are non-negative we conclude that
M
_
J
(x) J(x)
_
0(=)MJ
(x) MJ(x)
g(x) +MJ
(x) g(x) +MJ(x)

T(J
)(x) T(J)(x)
thus property (1) holds.
44
For property (2) we note that
T(J +re)(x) = g(x) +M(J +re)(x) = g(x) +MJ(x) +rMe(x) = T(J)(x) +rMe(x)
We have
1
Me(x)
2
so that
T(J +re)(x) T(J)(x)
r
= Me(x)
and
1

T(J +re)(x) T(J)(x)
r

2
Thus property (2) also holds if
2
< 1.
1.10
(a) If there is a unique such that T
(J) = T(J), then there exists an > 0 such that for all R
n
with max
i
|(i)| we have
F(J + ) = T(J + ) J = g
+P
(J + ) J = g
+ (P
I)(J + ).
It follows that F is linear around J and its Jacobian is P
I.
(b) We rst note that the equation dening Newtons method is the rst order Taylor series expansion of
F around J
k
. If
k
is the unique such that T
(J
k
) = T(J
k
), then F is linear near J
k
and coincides with
its rst order Taylor series expansion around J
k
. Therefore the vector J
k+1
is obtained by the Newton
iteration satises
F(J
k+1
) = 0
or
T
k(J
k+1
) = J
k+1
.
This equation yields J
k+1
= J
k, so the next policy

k+1
is obtained as
k+1
= arg min
(J
k).
This is precisely the policy iteration of the algorithm.
45
1.12
For simplicity, we consider the case where U(i) consists of a single control. The calculations are very
similar for the more general case. We rst show that

n
j=1
M
ij
= . We apply the denition of the
quantities

M
ij
n
j=1
M
ij
=
n
j=1
_
ij
+
(1 )(M
ij

ij
)
1 m
i
_
=
n
j=1
ij
+
n
j=1
(1 )(M
ij

ij
)
1 m
i
= 1 + (1 )
n
j=1
M
ij
1 m
i
(1 )
1 m
i
n
j=1
ij
= 1 + (1 )
m
i
1 m
i
(1 )
1 m
i
= 1 (1 ) = .
Let J
1
, . . . , J
n
satisfy
J
i
= g
i
+
n
j=1
M
ij
J
j
. (1)
We substitute J
into the new equation

J
i
= g
i
+
n
j=1
M
ij
J
j
and manipulate the equation until we reach a relation that holds trivially
J
1
=
g
i
(1 )
1 m
i
+
n
j=1
ij
J
j
+
1
1 m
i
n
j=1
(M
ij

ij
)J
j
=
g
i
(1 )
1 m
i
+J
i
+
1
1 m
i
n
j=1
M
ij
J
j

1
1 m
i
J
i
= J
i
+
1
1 m
i
_
_
g
i
+
n
j=1
M
ij
J
j
J
i
_
_
.
This relation follows trivially from Eq. (1) above. Thus J
is a solution of
J
i
= g
i
+
n
J=1
M
ij
J
j
.
1.17
The form of Bellmans Equation for the tax problem is
J(x) = min
i
_
_
j=i
c
j
(x
i
) +E
w
i {J[x
i
, x
i1
, f
i
(x
i
, w
i
)
_
_
46
Let

J(x) = J(x)
J(x) = max
i
_
_
j=1
c
j
(x
j
) +c
i
(x
i
) +E
w
i {
J[ ]}
_
_
Let

J(x) = (1 )

J(x) +
n
j=1
C
j
(x
j
) By substitution we obtain
J(x) = max
i
_
_
(1 )
n
j=1
c
j
(x
j
) + (1 )c
i
(x
i
) +E
w
i {(1 )

J[ ]}
_
_
= max
i
[c
i
(x
i
) E
w
i
{c
i
(f(x
i
, w
i
)}] +E
w
i
{
J( )}].
Thus

J satises Bellmans Equation of a multi-armed Bandit problem with
R
i
(x
i
) = c
i
(x
i
) E
w
i
{c
i
(f(x
i
, w
i
))}.
1.18
Bellmans Equation for the restart problem is
J(x) = max[R(x
0
) +E{J[f(x
0
, w)]}, R(x) +E{J[f(x, w)]}]. (A)
Now, consider the one-armed bandit problem with reward R(x)
J(x, M) = max{M, R(x) +E[J(f(x, w), M)]}. (B)
We have
J(x
0
, M) = R(x
0
) +E[J(f(x
0
, w), M)] > M
if M < m(x
0
) and J(x
0
, M) = M. This implies that
R(x
0
) +E[J(f(x
0
, w))] = m(x
0
).
Therefore the forms of both Bellmans Equations (A) and (B) are the same when M = m(x
0
).
47
Solutions Vol. II, Chapter 2
2.1
(a) (i) First, we need to dene a state space for the problem. The obvious choice for a state variable
is our location. However, this does not encapsulate all of the necessary information. We also need to
include the value of c if it is known. Thus, let the state space consist of the following 2m + 2 states:
{S, S
1
, . . . , S
m
, I
1
, . . . I
m
, D}, where S is associated with being at the starting point with no information,
S
i
and I
i
are associated with being at S and I, respectively, and knowing that c = c
i
, and D is the
termination state.
At state S, there are two possible controls: go directly to D (direct) or go to an intermediate
point (indirect). If control direct is selected, we go to state D with probability 1, and the cost is
g(S, direct, D) = a. If control indirect is selected, we go to state I
i
with probability p
i
, and the cost is
g(S, indirect, I
i
) = b.
At state S
i
, for i {1, . . . , m}, we have the same controls as at state S. Again, if control direct is
selected, we go to state D with probability 1, and the cost is g(S
i
, direct, D) = a. If, on the other hand,
control indirect is selected, we go to state I
i
with probability 1, and the cost is g(S, indirect, I
i
) = b.
At state I
i
, for i {1, . . . , m}, there are also two possible controls: go back to the start (start) or
go to the destination (dest). If control start is selected, we go to state S
i
with probability 1, and the
cost is g(I
i
, start, S
i
) = b. If control dest is selected, we go to state D with probability 1, and the cost is
g(I
i
, dest, D) = c
i
.
We have thus formulated the problem as a stochastic shortest path problem. Bellmans equation
for this problem is
J
(S) = min[a, b +
m
i=1
p
i
J
(I
i
)]
J
(S
i
) = min[a, b +J
(I
i
)]
J
(I
i
) = min[c
i
, b +J
(S
i
)].
We assume that b > 0. Then, Assumptions 5.1 and 5.2 hold since all improper policies have innite cost.
As a result, if
(I
i
) = start, then
(S
i
) = direct. If
(I
i
) = start, then we never reach state S
i
and
so it doesnt matter what the control is in this case. Thus, J
(S
i
) = a, and
(S
i
) = direct. From this,
it is easy to derive the optimal costs and controls for the other states
J
(I
i
) = min[c
i
, b +a]
(I
i
) =
_
dest, if c
i
< b +a
start, otherwise,
48
J
(S) = min[a, b +
m
i=1
p
i
min(c
i
, b +a)]
(S) =
_
direct, if a < b +
m
i=1
p
i
min(c
i
, b +a)
indirect, otherwise.
For the numerical case given, we see that a < b +

m
i=1
p
i
min(c
i
, b + a) since a = 2 and b +
m
i=1
p
i
min(c
i
, b +a) = 2.5. Hence (S) = direct. We need not consider the other states since they will
never be reached.
(ii) In this case, every time we are at the starting location, our available information is the same. We
thus no longer need the states S
i
from part (i). Our state space for this part is then S, I
1
, . . . , I
m
, D.
At state S, the possible controls are {direct, indirect}. If control direct is selected, we go to state
D with probability 1, and the cost is g(S, direct, D) = a. If control indirect is selected, we go to state I
i
with probability p
i
, and the cost is g(S, indirect, I
i
) = b [same as in part (ii)].
At state I
i
, for i {1, . . . , m}, the possible controls are {start, dest}. If control start is selected,
we go to state S with probability 1, and the cost is g(I
i
, start, S) = b. If control dest is selected, we go
to state D with probability 1, and the cost is g(I
i
, dest, D) = c
i
.
Bellmans equation for this stochastic shortest path problem is
J
(S) = min[a, b +
m
i=1
p
i
J
(I
i
)]
J
(I
i
) = min[c
i
, b +J
(S)].
The optimal policy can be described by
(S) =
_
direct, if a < b +
m
i=1
p
i
J
(I
i
)
indirect, otherwise,
(I
i
) =
_
dest, if c
i
< b +J
(S)
start, otherwise.
We will solve the problem for the numerical case by guessing an optimal policy and then showing
that the resulting cost J
satises J = TJ. Since J
is the unique solution to this equation, our policy

is optimal. So lets guess the initial policy to be
(S) = direct
(I
1
) = dest
(I
2
) = start.
Then
J(S) = a = 2 J(I
1
) = c
1
= 0 J(I
2
) = b +J
(S) = 1 + 2 = 3.
49
From Bellmans equation, we have
J(S) = min(2, 1 + 0.5(3 + 0)) = 2
J(I
1
) = min(0, 1 + 2)) = 0
J(I
2
) = min(5, 1 + 2)) = 3.
Thus, our policy is optimal.
(b) The state space for this problem is the same as for part a(ii): {S, I
1
, . . . , I
m
, D}.
At state S, the possible controls are {direct, indirect}. If control direct is selected, we go to state
D with probability 1, and the cost is g(S, direct, D) = a. If control indirect is selected, we go to state I
i
with probability p
i
, and the cost is g(S, indirect, I
i
) = b [same as in part a,(i) and (ii)].
At state I
i
, for i {1, . . . , m}, we have an additional option of waiting. So the possible controls
are {start, dest, wait}. If control start is selected, we go to state S with probability 1, and the cost
is g(I
i
, start, S) = b. If control dest is selected, we go to state D with probability 1, and the cost is
g(I
i
, dest, D) = c
i
. If control wait is selected, we go to state I
j
with probability p
j
, and the cost is
g(I
i
, wait, I
j
) = d.
Bellmans equation is
J
(S) = min[a, b +
m
i=1
p
i
J
(I
i
)]
J
(I
i
) = min[c
i
, b +J
(S), d +
m
j=1
p
j
J
(I
j
)].
We can describe the optimal policy as follows:
(S) =
_
direct, if a < b +
m
i=1
p
i
J
(I
i
)
indirect, otherwise.
If direct was selected, we do not need to consider the other states (other than D) since they will never
be reached. If indirect was selected, then dening k = min(2b, d), we see that
(I
i
) =
_
_
dest, if c
i
< k +
m
i=1
J
(I
i
)
start, if c
i
> k +
m
i=1
J
(I
i
) and 2b < d
wait, if c
i
> k +
m
i=1
J
(I
i
) and 2b > d.
50
2.2
Lets dene the following states:
H: Last ip outcome was heads
T: Last ip outcome was tails
C: Caught (this is the termination state)
(a) We can formulate this problem as a stochastic shortest path problem with state C being the termina-
tion state. There are four possible policies:
1
= {always ip fair coin},
2
= {always ip two-headed coin},
3
= {ip fair coin if last outcome was heads / ip two-headed coin if last outcome was tails}, and
4
=
{ip fair coin if last outcome was tails / ip two-headed coin if last outcome was heads}. The only way
to reach the termination state is to be caught cheating. Under all policies except
1
, this is inevitable.
Thus
1
is an improper policy, and
2
,
3
, and
4
are proper policies.
(b) Let J
1
(H) and J
2
(T) be the costs corresponding policy
1
where the starting state is H and T,
respectively. The expected benet starting from state T up to the rst return to T (and always using the
fair coin), is
1
2
_
1 +
1
2
+
1
2
2
+
_
m
2
=
1
2
(2 m).
Therefore
J
1
(T) =
_
_
+ if m < 2
0 if m = 2
if m > 2.
Also we have
J
1
(H) =
1
2
(1 +J
n
(H)) +
1
2
J
n
(T),
so
J
1
(H) = 1 +J
(T).
It follows that if m > 2, then
1
results in innite cost for any initial state.
(c,d) The expected one-stage rewards at each stage are
Play Fair in State H:
1
2
Cheat in State H: 1 p
Play Fair in State T:
1m
2
Cheat in State T: 0
We show that any policy that cheats at H at some stage cannot be optimal. As a result we can eliminate
cheating from the control constraint set of state H.
51
Indeed suppose we are at state H at some stage and consider a policy which cheats at the rst
stage and then follows the optimal policy
from the second stage on. Consider a policy which plays

fair at the rst stage, and then follows
from the second stage on if the outcome of the rst stage is H

or cheats at the second stage and follows
from the third stage on if the outcome of the rst stage is

T. We have
J

(H) = (1 p)[1 +J
(H)]
J

(H) =
1
2
(1 +J
(H)) +
1
2
_
(1 p)[1 +J
(H)]
_
=
1
2
+
1
2
[J
(H) +J

(H)]
1
2
+J

(H),
where the inequality follows from the fact that J
(H) J

(H) since
is optimal. Therefore the reward

of policy can be improved by at least
1
2
by switching to policy , and therefore cannot be optimal.
We now need only consider policies in which the gambler can only play fair at state H:
1
and
3
.
Under
1
, we saw from part b) that the expected benets are
J
1
(T) =
_
_
+ if m < 2
0 if m = 2
if m > 2,
and
J
1
(H) =
_
_
+ if m < 2
1 if m = 2
if m > 2.
Under
3
, we have
J
3
(T) = (1 p)J
3
(H),
J
3
(H) =
1
2
[1 +J
3
(H)] +
1
2
J
3
(T).
Solving these two equations yields
J
3
(T) =
1 p
p
,
J
3
(H) =
1
p
.
Thus if m > 2, it is optimal to cheat if the last ip was tails and play fair otherwise, and if m < 2, it is
optimal to always play fair.
52
2.7
(a) Let i be any state in S
m
. Then,
J(i) = min
uU(i)
[E{g(i, u, j) +J(j)}]
= min
uU(i)
_
_

jSm
p
ij
(u)[g(i, u, j) +J(j)] +
jS
m1
S
1
t
p
ij
(u)[g(i, u, j) +J(j)]
_
_
= min
uU(i)
_
_

jSm
p
ij
(u)[g(i, u, j) +J(j)] + (1
jSm
p
ij
(u))
jS
m1
S
1
t
p
ij
(u)[g(i, u, j) +J(j)]
(1
jSm
p
ij
(u))
_
_
.
In the above equation, we can think of the union of S
m1
, . . . , S
1
, and t as an aggregate termination state
t
m
associated with S
m
. The probability of a transition from i S
m
to t
m
(under u) is given by,
p
itm
(u) = 1
jSm
p
ij
(u).
The corresponding cost of a transition from i S
m
to t
m
(under u) is given by,
g(i, u, t
m
) =
j=S
m1
S
1
t
p
ij
(u)[g(i, u, j) +J(j)]
p
itm
(u)
.
Thus, for i S
m
, Bellmans equation can be written as,
J(i) = min
uU(i)
_
_

jSm
p
ij
(u)[g(i, u, j) +J(j)] +p
itm
(u)[ g(i, u, t
m
) + 0]
_
_
.
Note that with respect to S
m
, the termination state t
m
is both absorbing and of zero cost. Let t
m
and
g(i, u, t
m
) be similarly constructed for m = 1, . . . , M.
The original stochastic shortest path problem can be solved as M stochastic shortest path sub-
problems. To see how, start with evaluating J(i) for i S
1
(where t
1
= {t}). With the values of J(i),
for i S
1
, in hand, the g cost-terms for the S
2
problem can be computed. The solution of the original
problem continues in this manner as the solution of M stochastic shortest path problems in succession.
(b) Suppose that in the nite horizon problem there are n states. Dene a new state space S
new
and sets S
m
as follows,
S
new
= {(k, i)|k {0, 1, . . . , M 1} and i {1, 2, . . . , n}}
S
m
= {(k, i)|k = M m and i {1, 2, . . . , n}}
for m = 1, 2, . . . , M. (Note that the S
m
s do not overlap.) By associating S
m
with the state space of
the original nite-horizon problem at stage k = M m, we see that if i
k
S
m1
under all policies. By
augmenting a termination state t which is absorbing and of zero cost, we see that the original nite-
horizon problem can be cast as a stochastic shortest path problem with the special structure indicated in
the problem statement.
53
2.8
Let J
be the optimal cost of the original problem and

J be the optimal cost of the modied problem.
Then we have
J
(i) = min
u
n
j=1
p
ij
(u) (g(i, u, j) +J
(j)) ,
and
J(i) = min
u
n
j=1,j=i
p
ij
(u)
1 p
ii
(u)
_
g(i, u, j) +
g(i, u, i)p
ii
(u)
1 p
ii
(u)
+

J(j)
_
.
For each i, let
(i) be a control such that

J
(i) =
n
j=1
p
ij
(
(i)) (g(i,
(i), j) +J
(j)) .
Then
J
(i) =
_
_
n
j=1,j=i
p
ij
(
(i)) (g(i,
(i), j) +J
(j))
_
_
+p
ii
(
(i)) (g(i,
(i), i) +J
(i)) .
By collecting the terms involving J
(i) and then dividing by 1 p

ii
(
(i)),
J
(i) =
1
1 p
ii
(
(i))
_
_
_
_
_
n
j=1,j=i
p
ij
(
(i))(g(i,
(i), j) +J
(j))
_
_
+p
ii
(
(i))g(i,
(i), i)
_
_
_
.
Since

n
j=1,j=i
p
ij
(
(i))
1p
ii
(
(i))
= 1, we have
J
(i) =
1
1 p
ii
(
(i))
_
_
_
_
_
n
j=1,j=i
p
ij
(
(i))(g(i,
(i), j) +J
(j))
_
_
+
n
j=1,j=i
p
ij
(
(i))
1 p
ii
(
(i))
p
ii
(
(i))g(i,
(i), i)
_
_
_
=
n
j=1,j=i
_
p
ij
(
(i))
1 p
ii
(
(i))
(g(i,
(i), j) +J
(j) +
p
ii
(
(i))g(i,
(i), i)
1 p
ii
(
(i))
)
_
.
Therefore J
(i) is the cost of stationary policy {
, . . .} in the modied problem. Thus

J
(i)

J(i) i.
Similarly, for each i, let (i) be a control such that
J(i) =
n
j=1,j=i
p
ij
( (i))
1 p
ii
( mu(i))
_
g(i, (i), j) +
g(i, (i), i)p
ii
( (i))
1 p
ii
( (i))
+

J(j)
_
.
Then, using a reverse argument from before, we see that

J(i) is the cost of stationary policy { , , . . .}
in the original problem. Thus
J(i) J
(i) i.
Combining the two results, we have

J(i) = J
(i), and thus the two problems have the same optimal costs.
If p
ii
(u) = 1 for some i = t, we can eliminate u from U(i) without increasing J
(i) or any other

optimal cost J
(j), j = i. If that were not so, every optimal stationary policy must use u at state i and
therefore must be improper, which is a contradiction.
54
2.17
Consider a modied stochastic shortest path problem where the state space is denoted by S
, the control
space by U
, the transition costs by g
, and the transition probabilities by p
. Let the state space

S
= S
S
S
SU
, where
S
S
= {1, . . . , n, t} where each i S
S
corresponds to i S
S
SU
= {(i, u)|i S, u U(i)} where each (i, u) S
SU
corresponds to i S and u U(i).
For i, j S
S
, u U
(i), we dene U
(i) = U(i), g
(i, u, j) = g(i, u, j), and p
ij
(u) = p
ij
(u). For (i, u)
S
S
U and j S
S
, the only possible control is u
= u (i.e., U
(i, u) = {u}), and we have g
((i, u), u
, j) =
g(i, u, j) and p
(i,u)j
(u
) = p
ij
(u).
Since trajectories originating from a state i S
S
are equivalent to trajectories in the original
problem, the optimal cost-to-go value for state i in the modied problem is J
(i), the optimal cost-to-go

value from the original problem. Let us denote the optimal cost-to-go value for (i, u) S
SU
by J
(i, u).
Then J
(i) and J
(i, u) solve uniquely Bellmans equation of the modied problem, which is

J
(i) = min
uU(i)
n
j=1
p
ij
(u)(g(i, u, j) +J
(j)) (1)
J
(i, u) =
n
j=1
p
ij
(u)(g(i, u, j) +J
(j)). (2)
The Q-factors for the original problem are dened as
Q(i, u) =
n
j=1
p
ij
(u)(g(i, u, j) +J
(j)),
so from Eq. (2), we have
Q(i, u) = J
(i, u), (i, u). (3)

Also from Eqs. (1) and (2), we have
J
(i) = min
uU(i)
J
(i, u), i. (4)

Thus from Eqs. (1)-(4), we obtain
Q(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + min
u
U(j)
Q(j, u
)
_
. (5)
There remains to show that there is no other solution to Eq. (5). Indeed, if

Q(i, u) were such that
Q(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + min
u
U(j)
Q(j, u
)
_
, (i, u), (6)
55
then by dening
J(i) = min
uU(i)
Q(i, u) (7)
we obtain from Eq. (6)
Q(i, u) =
n
j=1
p
ij
(u)(g(i, u, j) +

J(j)), (i, u). (8)
By combining Eqs. (7) and (8), we have
J(i) = min
uU(i)
n
j=1
p
ij
(u)(g(i, u, j) +

J(j)), i. (9)
Thus

J(i) and

Q(i, u) satisfy Bellmans Eq. (1)-(2) for the modied problem. Since this Bellman equation
is solved uniquely by J
(i) and J
(i, u), we see that
Q(i, u) = J
(i, u) = Q(i, u), (i, u).

Thus the Q-factors Q(i, u) solve uniquely Eq. (5).
56
3.4
By using the relation T
(J
) T(J
) +e = J
+e and the monotonicity of T
, we obtain
T
2
(J
) T
(J
) +e J
+e +e.
Proceeding similarly, we obtain
T
k
(J
) T
(J
) +
_
k2
i=0
i
_
e J
+
k1
i=0
i
e
and by taking limit as k , the desired result J
+ (/(1 ))e follows.

3.5
Under assumption P, we have by Prop. 1.2(a), J
. Let r > 0 be such that

J
re.
Then, applying T
k
to this inequality, we have
J
= T
k
(J
) T
k
(J
)
k
re.
Taking the limit as k , we obtain J
, which combined with the earlier shown relation J
,
yields J
= J
. Under assumption N, the proof is analogous, using Prop. 1.2(b).

3.8
From the proof of Proposition 1.1, we know that there exists a policy such that, for all
i
> 0.
J
(x) J
(x) +
i=0
i
Let
i
=

2
i+1
i
> 0.
Thus,
J
(x) J
(x) +
i=0
1
2
i+1
= J
(x) + xS.
57
If < 1, choose
i
=

i=0
i
which is independent of i. In this case,
is stationary. If = 1, we may not have a stationary policy
. In particular, let us consider a system with only one state, i.e. S = {0}, U = (0, ), J
0
(0) = 0, and
g(0, u) = u. Then J
(0) = inf
(0) = 0 but for every stationary policy, J
k=0
u = .
3.9
Let
= {
0
,
1
, . . .} be an optimal policy. Then we know that
J
(x) = J
(x) = lim
k
(T
0
T
1
. . . T
k
)(J
0
)(x) = lim
k
(T
0
_
T
1
. . . T
k
)
_
(J
0
)(x).
From monotone convergence we know that
J
(x) = lim
k
T
0
(T
1
. . . T
k
)(J
0
)(x) = T
0
( lim
k
(T
1
. . . T
k
)(J
0
))(x)
T
0
(J
)(x) T(J
)(x) = J
(x)
Thus T
0
(J
)(x) = J
(x). Hence by Prop. 1.3, the stationary policy {
0
,
0
, . . .} is optimal.
3.12
We shall make an analysis similar to the one of 3.1. In particular, let
J
0
(x) = 0
T(J
0
)(x) = min[x
Qx +u
Ru] = xqx = x
K
0
x
T
2
(J
0
)(x) = min[x
Qx +u
Ru + (Ax +Bu)
Q(Ax +Bu)] = x
K
1
x,
where K
1
= Q+R +D
1
K
0
D
1
with D
1
= A+BL
1
and L
1
= (R +B
K
0
B)
1
B
K
0
A. Thus
T
k
(J
0
)(x) = x
K
k
x
where K
k
= Q + R + D
k
K
k1
D
k
with D
K
= A + BL
k
and L
k
= (R + B
K
k1
B)
1
B
K
k1
A. By
the analysis of Chapter 4 we conclude that K
k
K with K being the solution to the algebraic Ricatti
equation. Thus J
(x) = x
Kx = lim
N
T
N
(J
0
)(x). Then it is easy to verify that J
(x) = T(J
)(x)
and by Prop. 1.5 in Chapter 1, we have that J
(x) = J
(x).
For the periodic problem the controllability assumption is that there exists a nite sequence of
controls {u
0
, . . . , u
r
} such that x
r+1
= 0. Then the optimal control sequence is periodic
= {
0
,
1
, . . . ,
p1
,
0
,
1
, . . . ,
p1
, . . .},
58
where
i
= (R
i
+B
i
K
i+1
B
i
)
1
b
i
K
k+1
A
i
x
p1
= (R
p1
+B
p1
K
0
B
p1
)
1
B
p1
K
0
A
p1
x
and K
0
. . . , K
p1
satisfy the coupled set of p algebraic Ricatti equations
K
i
= A
i
[K
i+1
K
i+1
B
i
(R
i
+B
i
K
i+1
B
i
)
1
B
i
K
i+1
]A
i
+Q
i
, i = 0, . . . , p 2,
K
p1
= A
p1
[K
0
K
0
B
p1
(R
p1
+B
p1
K
0
B
p1
)
1
B
p1
K
0
]A
p1
+Q
p1
.
3.14
The formulation of the problem falls under assumption P for periodic policies. All the more, the problem
is discounted. Since w
k
are independent with zero mean, the optimality equation for the equivalent
stationary problem reduces to the following system of equations
(x
0
, 0) = min
u
0
U(x
0
)
E
w
0
{x
0
Q
0
x
0
+u
0
(x
0
)
R
0
u
0
(x
0
) +

J
(A
0
x
0
+B
0
u
0
+w
0
, 1)}
(x
1
, 1) = min
u
1
U(x
1
)
E
w
1
{x
1
Q
1
x
1
+u
1
(x
1
)
R
1
u
1
(x
1
) +

J
(A
1
x
1
+B
1
u
1
+w
1
, 2)}
. . .
(x
p1
, p 1) = min
u
p1
U(x
p1
)
E
w
p1
{x
p1
Q
p1
x
p1
+u
p1
(x
p1
)
R
p1
u
p1
(x
p1
)
+

J
(A
p1
x
p1
+B
p1
u
p1
+w
p1
, 0)}
(1)
From the analysis in 7.8 in Ch.7 on periodic problems we see that there exists a periodic policy
{
0
,
1
, . . . ,
p1
,
1
,
2
, . . . ,
p1
, . . .}
which is optimal. In order to obtain the solution we argue as follows: Let us assume that the solution is
of the same form as the one for the general quadratic problem. In particular, assume that
(x, i) = x
K
i
x +c
i
,
where c
i
is a constant and K
i
is positive denite. This is justied by applying the successive approximation
method and observing that the sets
U
k
(x
i
, , i) = {u
i
R
m
|x
Qx +u
i
Ru
i
+ (Ax +Bu
i
)
K
k
i+1
(Ax +Bu
i
) }
are compact. The latter claim can be seen from the fact that R 0 and K
k
i+1
0. Then by Proposition
7.7, lim
k

J
k
(x
i
, i) =

J
(x
i
, i) and the form of the solution obtained from successive approximation is
as described above.
59
In particular, we have for 0 i p 1
(x, i) = min
u
i
U(x
i
)
E
w
i
{x
Q
i
x +u
i
(x)
R
1
u
i
(x) +

J
(A
1
x +B
1
u
i
+w
i
, i + 1)}
= min
u
i
U(x
i
)
E
w
i
{x
Q
i
x +u
i
(x)
R
1
u
i
(x) +[(A
i
x +B
i
u
i
+w
i
)
k
i+1
(A
i
x +B
i
u
i
+w
i
) +c
i+1
]}
= min
u
i
U(x
i
)
E
w
i
{x
(Q
i
+A
i
K
i+1
A
i
)x
i
+u
i
(r
i
+B
i
K
i+1
B
i
)u
i
+ 2x
i
K
i+1
B
i
u
i
+
+ 2w
i
K
i+1
B
i
u
i
+ 2x
i
K
i+1
w
i
+w
i
K
i+1
w
i
+c
i+1
}
= min
u
i
U(x
i
)
{x
(Q
i
+A
i
K
i+1
A
i
)x
i
+u
i
(R
i
+B
i
K
i+1
B
i
)u
i
+ 2x
i
K
i+1
B
i
u
i
+
+w
i
K
i+1
w
i
+c
1
}
where we have taken into consideration the fact that E(w
i
) = 0. Minimizing the above quantity will give
us
u
i
= (R
i
+B
i
K
i+1
B
i
)
1
B
i
K
i+1
A
i
x (2)
Thus
(x, i) = x
[Q
i
+A
i
(K
i+1

2
K
i+1
(R
i
+B
i
K
i+1
B
i
)
1
B
i
K
i+1
)A
i
] x +c
i
= x
K
i
x +c
i
where c
i
= E
w
i
{w
i
K
i+1
w
i
} +c
i+1
and
K
i
= Q
i
+A
i
(K
i+1

2
K
i+1
(R
i
+B
i
K
i+1
B
i
)
1
B
i
K
i+1
)A
i
.
Now for this solution to be consistent we must have K
p
= K
0
. This leads to the following system of
equations
K
0
= Q
0
+A
0
(K
1

2
K
1
(R
0
+B
0
K
1
B
0
)
1
B
0
K
1
)A
0
. . .
K
i
= Q
i
+A
i
(K
i+1

2
K
i+1
(R
i
+B
i
K
i+1
B
i
)
1
B
i
K
i+1
)A
i
. . .
K
p1
= Q
p1
+A
p1
(K
0

2
K
0
(R
p1
+B
p1
K
0
B
p1
)
1
B
p1
K
0
)A
p1
(3)
This system of equations has a positive denite solution since (from the description of the problem) the
system is controllable, i.e. there exists a sequence of controls such that {u
0
, . . . , u
r
} such that x
r+1
= 0.
Thus the result follows.
3.16
(a) Consider the stationary policy, {
0
,
0
, . . . , }, where
0
= L
0
x. We have
J
0
(x) = 0
60
T
0
(J
0
)(x) = x
Qx +x
0
RL
0
x
T
2
0
(J
0
)(x) = x
Qx +x
0
RL
0
x +(Ax +BL
0
x +w)
Q(Ax +BL
0
x +w)
= x
M
1
x + constant
where M
1
= Q+L
0
RL
0
+(A+BL
0
)
Q(A+BL
0
),
T
3
0
(J
0
)(x) = x
Qx +x
0
RL
0
x +(Ax +BL
0
x +w)
M
1
(Ax +BL
0
+w) + (constant)
= x
M
2
x + constant
Continuing similarly, we get
M
k+1
= Q+L
0
RL
0
+(A+BL
0
)
M
k
(A+BL
0
).
Using a very similar analysis as in Section 8.2, we get
M
k
K
0
where
K
0
= Q+L
0
RL
0
+(A+BL
0
)
K
0
(A+BL
0
)
(b)
J
1
(x) = lim
N
E w
k
k=0,,N1
_
N1
k=0
k
_
x
k
Qx
k
+
1
(x
k
)
R
1
(x
k
)
_
= lim
N
T
N
1
(J
0
)(x)
Proceeding as in the proof of the validity of policy iteration (Section 7.3, Chapter 7). We have
T
1
(J
0
) = T(J
0
)
J
0
(x) = x
K
0
x + constant = T
0
(J
0
)(x) T
1
(J
0
(x)
Hence, we obtain
J
0
(x) T
1
(J
0
)(x) . . . T
k
1
(J
0
)(x) . . .
implying,
J
0
(x) lim
k
T
k
1
(J
0
)(x) = J
1
(x).
(c) As in part (b), we show that
J
k
(x) = x
K
k
x + constant J
k1
(x).
Now since
0 x
K
k
x x
K
k1
x, x
61
we have
K
k
K.
The form of K is,
K = (A+BL)
K(A+BL) +Q+L
RL
L = (B
KB +R)
1
B
KA
To show that K is indeed the optimal cost matrix, we have to show that it satises
K = A
[K
2
KB(B
KB +R)
1
B
K]A+Q
= A
[KA+KBL] +Q
Let us expand the formula for K, using the formula for L,
K = (A
KA+A
KBL +L
KA+L
KBL) +Q+L
RL.
Substituting, we get
K = (A
KA+A
KBL +L
KA) +QL
KA
= A
KA+A
KBL +Q.
Thus K is the optimal cost matrix.
A second approach: (a) We know that
J
0
(x) = lim
n
T
n
0
(J
0
)(x).
Following the analysis at 8.1 we have
J
0
(x) = 0
T
0
(J)(x) = E{x
Qx +
0
(x)
R
0
(x)} = x
Qx +
0
(x)
R
0
(x) = x
(Q+L
0
RL
0
)x
T
2
0
(J)(x) = E{x
Qx +
0
(x)R
0
(x) +(Ax +B
0
(x) +w)
Q(Ax +B
0
(x) +w)}
= x
(Q+L
0
RL
0
+(A+BL
0
)
Q(A+BL
0
)) x +E{w
Qw}.
Dene
K
0
0
= Q
K
k+1
0
= Q+L
0
RL
0
+(A+BL
0
)
K
k
0
(A+BL
0
).
Then
T
k+1
0
(J)(x) = x
K
k+1
0
x +
k1
m=0
km
E{w
K
m
0
w}.
The convergence of K
k+1
0
follows from the analysis of 4.1. Thus
J
0
(x) = x
K
0
x +

1
E{w
K
0
w}
62
(as in 8.1) which proves the required relation.
(b) Let
1
(x) be the solution of the following
min
u
{u
Ru +(Ax +Bu)
K
0
(Ax +Bu)}
which yields
u
1
= (R +B
K
0
B)
1
B
K
0
Ax = L
1
x.
Thus
L
1
= (R +B
K
0
B)
1
B
K
0
A = M
1
where M = R +B
K
0
B and = B
K
0
A. Let us consider the cost associated with u
1
if we ignore w
J
1
(x) =
k=0
k
(x
k
Qx
k
+
1
(x
k
)
Rm
1
(x
k
)) =
k=0
k
x
k
(Q+L
1
RL
1
)x
k
.
However, we know the following
x
k+1
= (A+BL
1
)
k+1
x
0
+
k+1
m=1
(A+BL
1
)
k+1m
w
m
.
Thus, if we ignore the disturbance w we get
J
1
(x) = x
k=0
k
(A+BL
1
)
k
(Q+L
1
RL
1
)(A+BL
1
)
k
x
0
.
Let us call
K
1
=
k=0
k
(A+BL
1
)
k
(Q+L
1
RL
1
)(A+BL
1
)
k
x
0
. (1)
We know that
K 0 (A+BL
0
)
K
0
(A+BL
0
) L
0
RL
0
= Q.
Substituting in (1) we have
K
1
=
k=0
k
(A+BL
1
)
k
(K
0
+(A+BL
1
)
K
0
(A+BL
1
))(A+BL
1
)+
+
k=0
{
k
(A+BL
1
)
k
[(A+BL
1
)
K
0
(A+BL
1
) (A+BL
0
)
K
0
(A+BL
0
)+
+L
1
RL
1
L
0
RL
0
](A+BL
1
)
k
}.
However, we know that
K
0
=
k=0
k
(A+BL
1
)
k
(K
0
(A+BL
1
)
K
0
(A+BL
1
)) (A+BL
1
)
k
.
63
Thus we conclude that
K
1
K
0
=
k=0
k
(A+BL
1
)
k
(A+BL
1
)
k
where
= (A+BL
1
)
K
0
(A+BL
1
) (A+BL
0
)
K
0
(A+BL
0
) +L
1
K
0
L
1
+L
0
K
0
L
0
.
We manipulate the above equation further and we obtain
= L
1
(R +B
K
o
B)L
1
L
0
(R +B
K
0
B)L
0
+L
1
B
K
0
A+A
K
0
BL
1
0
B
K
0
AA
K
0
BL
0
= L
1
ML
1
L
0
ML
0
+L
1
+
L
1
L
L
0
= (L
0
L
1
)
M(L
0
L
1
) ( +ML
1
)
(L
0
L
1
) (L
0
L
1
)
( +ML
1
).
However, it is seen that
+ML
1
= 0.
Thus
= (L
0
L
1
)
M(L
0
L
1
).
Since M 0 we conclude that
K
0
K
1
=
k=0
k
(A+BL
1
)
k
(L
0
L
1
)M(L
0
L
1
)(A+BL
1
)
k
0.
Similarly, the optimal solution for the case where there are no disturbances satises the equation
K = Q+L
RL +(A+BL)
K(A+BL)
with L = (R +B
KB)
1
B
KA. If we follow the same steps as above we will obtain

K
1
K =
k=0
k
(A+BL
1
)
k
(L
1
L)
M(L
1
L)(A+BL
1
)
k
0.
Thus K K
1
K
0
. Since K
1
is bounded, we conclude that A + BL
1
is stable (otherwise K
1
).
Thus, the sum converges and K
1
is the solution of K
1
= (A + BL
1
)
K
1
(A + L
1
) + Q + L
1
RL
1
. Now
returning to the case with the disturbances w we conclude as in case (a) that
J
1
(x) = x
K
1
x +

1
E{w
K
1
w}.
Since K
1
K
0
we conclude that J
1
(x) J
0
(x) which proves the result.
c) The policy iteration is dened as follows: Let
L
k
= (R +B
K
k1
B)
1
B
K
k1
A.
64
Then
k
(x) = L
k
x and
J
k
(x) = x
K
k
x +

1
E{w
K
k
w}
where K
k
is obtained as the solution of
K
k
= (A+BL
k
)
K
k
(A+BL
k
) +Q+L
k
RL
k
.
If we follow the steps of (b) we can prove that
K K
k
. . . K
1
K
0
. (2)
Thus by the theorem of monotonic convergence of positive operators (Kantorovich and Akilov p.
189: Functional Analysis in Normed Spaces) we conclude that
K
= lim
p
K
k
exists. Then if we take the limit of both sides of eq. (2) we have
K
= (A+BL
(A+L
) +Q+L
RL
with
L
= (R +B
B)
1
B
A.
However, according to 4.1, K is the unique solution of the above equation. Thus, K
= K and
the result follows.
65
4.4
(a) We have
T
k+1
h
0
= T (T
k
h
0
) = T
_
h
k
i
+
_
T
k
h
0
_
(i)e
_
= Th
k
i
+
_
T
k
h
0
_
(i).
The ith component of this equation yields
_
T
k+1
h
0
_
(i) =
_
Th
k
i
_
(i) +
_
T
k
h
0
_
(i).
Subtracting these two relations, we obtain
T
k+1
h
0
_
T
k+1
h
0
_
(i) = Th
k
i

_
Th
k
i
_
(i),
from which
h
k+1
i
= Th
k
i

_
Th
k
i
_
(i).
Similarly, we have
T
k+1
h
0
= T
_
T
k
h
0
_
= T
_
h
k
+
1
n
_
T
k
h
0
_
(i)e
_
= T
h
k
+
1
n
_
T
k
h
0
_
(i)e.
From this equation, we obtain
1
n
_
T
k+1
h
0
_
(i) =
1
n
_
T
h
k
_
(i) +
1
n
_
T
k
h
0
_
(i)e.
By subtracting these two relations, we obtain
h
k+1
= T
h
k
1
n
_
T
h
k
_
(i).
The proof for

h
k
is similar.
(b) We have
h
k
= T
k
h
0
_
1
n
i
_
T
k
h
0
_
(i)
_
e =
1
n
n
i=1
h
k
i
.
So since h
k
i
converges, the same is true for

h
k
. Also,
h
k
= T
k
h
0
min
i
_
T
k
h
0
_
(i)e
and
h
k
(j) =
_
T
k
h
0
_
(j) min
i
_
T
k
h
0
_
(i)
= max
i
_
_
T
k
h
0
_
(j)
_
T
k
h
0
_
(i)
_
= max
i
h
k
i
(j).
Since h
k
i
converges, the same is true for

h
k
.
66
4.8
Bellmans equation for the auxiliary (1 )discounted problem is as follows:
J(i) = min
uU(i)
[g(i, u) + (1 )
j
p
ij
(u)

J(j)]. (1)
Using the denition of p
ij
(u), we obtain
j
p
ij
(u)

J(j) =
j=t
(1 )
1
p
ij
(u)

J(j) + (1 )
1
(p
it
(u) )

J(t),
or
j
p
ij
(u)

J(j) =
j
(1 )
1
p
ij
(u)

J(j) (1 )
1

J(t).
This together with (1) leads to
J(i) = min
uU(i)
[g(i, u) +
j
p
ij
(u)

J(j)

J(t)],
or, equivalently,

J(t) +

J(i) = min
uU(i)
[g(i, u) +
j
p
ij
(u)

J(j)]. (2)
Returning to the problem of minimizing the average cost per stage, we notice that we have to solve the
equation
+h(i) = min
uU(i)
[g(i, u) +
j
p
ij
(u)h(j)]. (3)
Using (2), it follows that (3) is satised for =

J(t) and h(i) =

J(i) for all i. Thus, by Proposition 2.1,
we conclude that

J(t) is the optimal average cost and

J(i) is a corresponding dierential cost at state i.
67

Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Bertsekas) Dynamic Programming and Optimal Control - Solutions Vol 2

Загружено:

Авторское право:

Доступные форматы

Solutions Vol.

(J) F(J) +e. Then, using Eq. (3),

(J) T(J) +e. Then, using Eq. (3),

is the unique xed point

. From Eqs. (1), (2), and (5) we see that

may be reversed, so we can also obtain

)(i) F(J)(i) max

+me)(x) T(J)(x) T(J

(x). The upper bounds follow by a similar argument. Since

(x) from Eq. (4). Similarly, let J

F(J +re)(1) F(J)(1)

F(J +re)(2) F(J)(2)

F(J +re)(x) F(J)(x)

F(J +re)(i + 1) F(J)(i + 1)

(x) J(x) 0 Since all the elements m

(x) g(x) +MJ(x)

k, so the next policy

into the new equation

satises J = TJ. Since J

is the unique solution to this equation, our policy

from the second stage on. Consider a policy which plays

from the second stage on if the outcome of the rst stage is H

from the third stage on if the outcome of the rst stage is

is optimal. Therefore the reward

be the optimal cost of the original problem and

(i) be a control such that

(i) and then dividing by 1 p

(i) is the cost of stationary policy {

, . . .} in the modied problem. Thus

(i) or any other

, the transition costs by g

, and the transition probabilities by p

. Let the state space

(i, u, j) = g(i, u, j), and p

(i, u) = {u}), and we have g

(i), the optimal cost-to-go

(i, u) solve uniquely Bellmans equation of the modied problem, which is

(i, u), (i, u). (3)

(i, u), i. (4)

(i, u), we see that

(i, u) = Q(i, u), (i, u).

+e and the monotonicity of T

+ (/(1 ))e follows.

. Let r > 0 be such that

, which combined with the earlier shown relation J

. Under assumption N, the proof is analogous, using Prop. 1.2(b).

is stationary. If = 1, we may not have a stationary policy

(0) = 0 but for every stationary policy, J

(x). Hence by Prop. 1.3, the stationary policy {

KA. If we follow the same steps as above we will obtain

Вам также может понравиться