SC643 2019 Lecture10

SC 643: Stochastic Networked Control Autumn 2019
Lecture 10: September 3 2019

Instructor: Ankur A. Kulkarni Scribes: Ankit Pal
Note: LaTeX template courtesy of UC Berkeley EECS dept.

Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
10.1 Unichain
Definition 10.1.1. An MDP is said to be unicahin if every deterministic stationary policy yields a MC
with a single communicating class of recurrent states and set of transient states, possibly empty.
• For general MDP checking unichain is a NP-hard problem.

• If P (j|s, a) > 0, for all s, λ ∈ (0, 1), then unichain trivially holds.
The value function for discounted reward is given by,

X
Vλ (s) = max {r(s, a) + λ P (j|s, a)Vλ (j)} (10.1)
a∈As
j∈S
for all s and λ ∈ (0, 1)

For average reward,
T
1X π
g π (s) = lim Es [r(Xt , Yt )] (10.2)
T →∞ T
t=1
Now rewriting equation(10.1) by replacing the value function as following, we obtain

X
g + h(s) = max {r(s, a) + P (j|s, a)h(j)} (10.3)
a∈A
j∈S
Where g, h are constant. We need to find such g, h which will satisfy equation(10.3).
Equation(10.3) is referred as Optimality Equation.
Theorem 10.1. Consider a finite state-action unichain MDP with any reward. Then any deterministic
stationary Markov optimal policy d∞ satisfies the following equation
X
g + h(s) = max {r(s, a) + P (j|s, a)h(j)} (10.4)
a∈A
j∈S
and X
d(s) ∈ argmax{r(s, a) + P (j|s, a)h(j)} (10.5)
j∈S
Equation(10.4) is called Average Cost Optimality Equation(ACOE) or Bellman’s equation. More-

over g = g ∗ for all s and h is not unique, where h(s) is called relative value function.
10-1
10-2 Lecture 10: September 3 2019
Theorem 10.2. For a finite state-action unichain MDP, there exists a solution to the ACOE and the
resulting d∞ is optimal.
min g
g, h
(10.6)
X
s.t. g + h(s) ≥ r(s, a) + P (j|s, a) h(j), ∀a ∈ As , ∀s ∈ S
j∈S
The above inequality constraint is called Poisson’s inequality.

Now writing the dual for the above problem,
X X
max r(s, a) x(s, a)
x
s∈S a∈As
X X X
s.t. x(a, j) = x(s, a) P (j|s, a), ∀j ∈ S, (10.7)
a∈Aj j a∈Aj
X X
x(s, a) = 1,
s∈S a∈As
x(s, a) ≥ 0
Where x(s, a) is known as Occupation Measure.
The left-hand side of the first ineqality in equation(10.7) is the probability of being in state j and the right-
hand side is the probability of going to the next state j under state s and action a and with same distribution
x.
10.1.1 Constraint MDPO(CMDP)
The CMDP is of the form as follows,
T
1X
lim E[r(Xt , Yt )]
T →∞ T
t=1
(10.8)
T
1X
such that lim E[C(Xt , Yt )]
T →∞ T
t=1
10.1.2 Relation between Discounted reward and Average reward
The value function for discounted reward is denoted as Vλ (s) and governed by equation(10.1), whereas in
case of average reward the value function is denoted with g and h and governed by equation(10.3).
g = lim (1 − λ)Vλ (s), ∀s ∈ S, under unichain

λ→1
(10.9)
h(s) = lim (Vλ (s) − Vλ (j)), ∀j
λ→1
Lecture 10: September 3 2019 10-3
10.2 Partially Observed MDP
The key ingredients required to address a partially observed MDP are as follows:
1. State space, S
2. Action space, A
3. Transition probability, P (xt+1 = j|xt = s, ut = a)
4. Observation channel, P (yt+1 = y|xt = s, ut = a), where yt = observation at time t, xt = state at time
t, ut =action time t.
5. Information: y1 , − − −, yt (Perfect observation ⇔ space of observation = state space, and yt = st with

probability 1 )
6. Reward, r(xt , ut )
References

SC643 2019 Lecture10

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

SC643 2019 Lecture10

Загружено:

Авторское право:

Доступные форматы

SC 643: Stochastic Networked Control Autumn 2019

Lecture 10: September 3 2019

Note: LaTeX template courtesy of UC Berkeley EECS dept.

• For general MDP checking unichain is a NP-hard problem.

The value function for discounted reward is given by,

for all s and λ ∈ (0, 1)

Now rewriting equation(10.1) by replacing the value function as following, we obtain

Equation(10.4) is called Average Cost Optimality Equation(ACOE) or Bellman’s equation. More-

The above inequality constraint is called Poisson’s inequality.

Where x(s, a) is known as Occupation Measure.

10.1.1 Constraint MDPO(CMDP)

The CMDP is of the form as follows,

10.1.2 Relation between Discounted reward and Average reward

g = lim (1 − λ)Vλ (s), ∀s ∈ S, under unichain

10.2 Partially Observed MDP

5. Information: y1 , − − −, yt (Perfect observation ⇔ space of observation = state space, and yt = st with

Вам также может понравиться