Вы находитесь на странице: 1из 3

SC 643: Stochastic Networked Control Autumn 2019

Lecture 10: September 3 2019


Instructor: Ankur A. Kulkarni Scribes: Ankit Pal

Note: LaTeX template courtesy of UC Berkeley EECS dept.


Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.

10.1 Unichain

Definition 10.1.1. An MDP is said to be unicahin if every deterministic stationary policy yields a MC
with a single communicating class of recurrent states and set of transient states, possibly empty.

• For general MDP checking unichain is a NP-hard problem.


• If P (j|s, a) > 0, for all s, λ ∈ (0, 1), then unichain trivially holds.

The value function for discounted reward is given by,


X
Vλ (s) = max {r(s, a) + λ P (j|s, a)Vλ (j)} (10.1)
a∈As
j∈S

for all s and λ ∈ (0, 1)


For average reward,
T
1X π
g π (s) = lim Es [r(Xt , Yt )] (10.2)
T →∞ T
t=1

Now rewriting equation(10.1) by replacing the value function as following, we obtain


X
g + h(s) = max {r(s, a) + P (j|s, a)h(j)} (10.3)
a∈A
j∈S

Where g, h are constant. We need to find such g, h which will satisfy equation(10.3).
Equation(10.3) is referred as Optimality Equation.
Theorem 10.1. Consider a finite state-action unichain MDP with any reward. Then any deterministic
stationary Markov optimal policy d∞ satisfies the following equation
X
g + h(s) = max {r(s, a) + P (j|s, a)h(j)} (10.4)
a∈A
j∈S

and X
d(s) ∈ argmax{r(s, a) + P (j|s, a)h(j)} (10.5)
j∈S

Equation(10.4) is called Average Cost Optimality Equation(ACOE) or Bellman’s equation. More-


over g = g ∗ for all s and h is not unique, where h(s) is called relative value function.

10-1
10-2 Lecture 10: September 3 2019

Theorem 10.2. For a finite state-action unichain MDP, there exists a solution to the ACOE and the
resulting d∞ is optimal.
min g
g, h
(10.6)
X
s.t. g + h(s) ≥ r(s, a) + P (j|s, a) h(j), ∀a ∈ As , ∀s ∈ S
j∈S

The above inequality constraint is called Poisson’s inequality.


Now writing the dual for the above problem,
X X
max r(s, a) x(s, a)
x
s∈S a∈As

X X X
s.t. x(a, j) = x(s, a) P (j|s, a), ∀j ∈ S, (10.7)
a∈Aj j a∈Aj
X X
x(s, a) = 1,
s∈S a∈As

x(s, a) ≥ 0

Where x(s, a) is known as Occupation Measure.

The left-hand side of the first ineqality in equation(10.7) is the probability of being in state j and the right-
hand side is the probability of going to the next state j under state s and action a and with same distribution
x.

10.1.1 Constraint MDPO(CMDP)

The CMDP is of the form as follows,

T
1X
lim E[r(Xt , Yt )]
T →∞ T
t=1
(10.8)
T
1X
such that lim E[C(Xt , Yt )]
T →∞ T
t=1

10.1.2 Relation between Discounted reward and Average reward

The value function for discounted reward is denoted as Vλ (s) and governed by equation(10.1), whereas in
case of average reward the value function is denoted with g and h and governed by equation(10.3).

g = lim (1 − λ)Vλ (s), ∀s ∈ S, under unichain


λ→1
(10.9)
h(s) = lim (Vλ (s) − Vλ (j)), ∀j
λ→1
Lecture 10: September 3 2019 10-3

10.2 Partially Observed MDP

The key ingredients required to address a partially observed MDP are as follows:

1. State space, S

2. Action space, A
3. Transition probability, P (xt+1 = j|xt = s, ut = a)
4. Observation channel, P (yt+1 = y|xt = s, ut = a), where yt = observation at time t, xt = state at time
t, ut =action time t.

5. Information: y1 , − − −, yt (Perfect observation ⇔ space of observation = state space, and yt = st with


probability 1 )
6. Reward, r(xt , ut )

References

Вам также может понравиться