Академический Документы
Профессиональный Документы
Культура Документы
10.1 Unichain
Definition 10.1.1. An MDP is said to be unicahin if every deterministic stationary policy yields a MC
with a single communicating class of recurrent states and set of transient states, possibly empty.
Where g, h are constant. We need to find such g, h which will satisfy equation(10.3).
Equation(10.3) is referred as Optimality Equation.
Theorem 10.1. Consider a finite state-action unichain MDP with any reward. Then any deterministic
stationary Markov optimal policy d∞ satisfies the following equation
X
g + h(s) = max {r(s, a) + P (j|s, a)h(j)} (10.4)
a∈A
j∈S
and X
d(s) ∈ argmax{r(s, a) + P (j|s, a)h(j)} (10.5)
j∈S
10-1
10-2 Lecture 10: September 3 2019
Theorem 10.2. For a finite state-action unichain MDP, there exists a solution to the ACOE and the
resulting d∞ is optimal.
min g
g, h
(10.6)
X
s.t. g + h(s) ≥ r(s, a) + P (j|s, a) h(j), ∀a ∈ As , ∀s ∈ S
j∈S
X X X
s.t. x(a, j) = x(s, a) P (j|s, a), ∀j ∈ S, (10.7)
a∈Aj j a∈Aj
X X
x(s, a) = 1,
s∈S a∈As
x(s, a) ≥ 0
The left-hand side of the first ineqality in equation(10.7) is the probability of being in state j and the right-
hand side is the probability of going to the next state j under state s and action a and with same distribution
x.
T
1X
lim E[r(Xt , Yt )]
T →∞ T
t=1
(10.8)
T
1X
such that lim E[C(Xt , Yt )]
T →∞ T
t=1
The value function for discounted reward is denoted as Vλ (s) and governed by equation(10.1), whereas in
case of average reward the value function is denoted with g and h and governed by equation(10.3).
The key ingredients required to address a partially observed MDP are as follows:
1. State space, S
2. Action space, A
3. Transition probability, P (xt+1 = j|xt = s, ut = a)
4. Observation channel, P (yt+1 = y|xt = s, ut = a), where yt = observation at time t, xt = state at time
t, ut =action time t.
References