TD

Table 1: Summary of Temporal Difference Learning. In below n = |S| is the number of states.
Remarks Problem
2-step TD(0) LLSR TD+LFA (On Policy)
(st , rt , s0t )∞
t=0 (X(st ), rt , X(s0t ))∞
t=0
0 0
(st , s0t )∞ st = st+1 st = st+1
t=0
iid iid
Data st = s, is fixed s0t ∼ pu (st , ·) (Xi , yi )ni=1 s0t ∼ pu (st , ·)
iid
Dist. s0t ∼ p(s, ·) d> >
u = du Pu
1
n d> >
u = du P u
P1
V∗ = E[ t=0 R(st )|s0 = s, P∞
Vu (s) = E[ t=0 γ t rt |s0 = s,
iid
Aim ∼ p(s, ·)]
s1 P st+1 ∼ pu (st , ·)] minθ ||X > θ − Y ||22 Vu ≈ X > θ
Sol. V∗ = R(s) + s0 ∈S p(s, s0 )R(s0 ) Vu = (I − γPu )−1 R θ∗ = (XX > )−1 XY
Vt+1 = Vt + αt (R(st ) Vt+1 (st ) = Vt (st ) + αt R(st ) θt+1 = θt + αt Xt (Yt θt+1 = θt + αt Xt (R(st )
+γR(s0t ) − Vt ) +γVt (s0t ) − Vt (st ) −Xt> θt ) +γX(s0t )> θt − X(st )> θt )

Algo.
θ∈ R1 Rn Rd Rd
>
A 1 Du (I − γPu ) 1
n XX XDu (I − γPu )X >
1
b V∗ Du R n XY Du XR
Stab. Simple Greshgorin (for any D) Symmetric Positive Definite To be proved
Table 2: ON Policy vs OFF Policy Comparison. First Two Columns are same as Table 1.
Remarks Problem
TD(0) TD+LFA (On Policy)
(st , rt , s0t )∞
t=0 (X(st ), rt , X(s0t ))∞
t=0 (st , rt , at , s0t )∞
t=0 (X(st ), rt , at , X(s0t ))∞
t=0
0 0 0 0
st = st+1 st = st+1 st = st+1 st = st+1
iid iid iid iid
Data s0t ∼ pu (st , ·) s0t ∼ pu (st , ·) s0t ∼ pat (st , ·), at ∼ µ(st , ·) s0t ∼ pat (st , ·), at ∼ µ(st , ·)
Dist. d>uP= d>u Pu d> >
u = du Pu d> µ = dµ Pµ
>
d> µ = dµ Pµ
>
∞
Vu (s) = E[ t=0 γ t rt |s0 = s,
Aim st+1 ∼ pu (st , ·)] Vu ≈ X > θ Vπ (s), ∀s ∈ S Vπ ≈ X > θ
Sol. Vu = (I − γPu )−1 R A−1 b Vπ = (I − γPπ )−1 R
Vt+1 (st ) = Vt (st ) + αt R(st ) θt+1 = θt + αt Xt (R(st ) Vt+1 (st ) = Vt (st ) + αt ρt R(st ) θt+1 = θt + αt ρt Xt (R(st )
+γVt (s0t ) − Vt (st ) +γX(s0t )> θt − X(st )> θt ) +γVt (s0t ) − Vt (st ) +γX(s0t )> θt − X(st )> θt )

Algo.
θ∈ Rn Rd Rd Rd
A Du (I − γPu ) XDu (I − γPu )X > Dµ (I − γPπ ) XDµ (I − γPπ )X >
b Du R XDu R Dµ R XDµ R
Stab. Greshgorin To be proved Greshgorin Does not converge
Remarks:
• First level of understanding is to notice the dimensions and symbols. Sometimes (for TD(0))
we are using V and sometimes (LLSR, TD+LFA) we are using θ. The difference between
use of θ and V is that, when V is used it means we are trying to calculate values for each
and every state V (s), ∀s ∈ S, this is known as Tabular column representation or full state.
When we are using θ we are trying to approximate the value by V (s) ≈ X(s)> θ (function
approximation). When we collect all the V (s) for all  the>states we get a vector V ∈ Rn .

1
¯X (s )¯
 ¯X > (s2 )¯ 
When we collect all the X(s)> into a matrix X > =  , we use V ≈ X > θ.
 
..
 . 
¯X > (sn )¯
• Note that X is a d × n matrix, and X > is an n × d matrix. Further, we can recover the tabular
column case, from the function approximate case, by choosing d = n and X = In×n , where
In×n is the n × n identity matrix.
3
1 Error Analysis
Our aim is to analyse the algorithms in Tables 1 and 2.
Steps in the Analysis:
• First we have to write all the algorithms in Tables 1 and 2 in a common format. Luckily, we
can actually do it, see Equation (1).
• Then we analyse the error et , which is the difference between the final solution and the
current values.
• Put down the conditions under which the algorithms will actually converge to the solution.
• Study the behaviour of the error term et . It will tell us the mode of convergence.
1.1 Common format
All algorithms in Tables 1 and 2 can be written in the below format:

θt+1 = θt + αt (b − Aθt + Nt ) (1)
How to obtain the common format in Equation (1): If we look carefully every algorithm is
θt+1 = θt + αt (·). Say that the terms inside the bracket is (·). Then we do · = E[·] + · − E[·]. In
all the cases, E[·] = b − Aθt (the exact b and A changes from algorithm to algorithms as show in
Tables 1 and 2), and Nt = · − E[·] is a zero-mean noise term. If noise term was not there, then
algorithm will be deterministic, i.e., it will be the same from run to run.
1.2 Dynamics of Error
We have to make some assumptions to make our analysis.

Assumption 1.1. A is invertible.
To make our analysis simpler, we let αt = α > 0, i.e., a fixed constant step-size, we will look at
the error terms. The constant step-size is an important assumption, i.e., we can analyse for constant
step-size and later infer what will happen if we change the step-size and in fact how to even change
the step-sizes (should we increase or decrease, if so, by how much etc).
Let θ∗ = A−1 b. Now define et = θt − θ∗ , we need to study how this behaves.
θt+1 − θ∗ = θt − θ∗ + α (b − A(θt − θ∗ + θ∗ ) + Nt ) (2)
(I)
et+1 = et + α(b − Aet − Aθ∗ ) + αNt (3)
(II)
et+1 = (I − αA)et + αNt (4)
t
(III) X
et+1 = (I − αA)t+1 e0 + α (I − αA)s Nt−s (5)
s=0
In the above:
(I) follows from applying definition of et = θt − θ∗ .
(II) follows from b = Aθ∗ .
(III) follows from unrolling the recursion
From Equation (5) we can infer that (I − αA)t+1 should not go to ∞ as t → ∞. It is obvious that
for all matrices this will not happen. So, we need further assumption to ensure convergence.
Assumption 1.2. A has eigenvalues with positive real parts.
Note: We will talk about eigenvalues and why they are useful a little bit later.
Assumption 1.2 ensures that the term (I − αA)t+1 does not blow up. However, it is not very clear to
see how et actually behaves.
4
1.3 Learning the mean of a Radom variable
Let us say the mean of a random variable is θ∗ , and each sample of the random variable is nothing by
θ∗ + Nt . Now, we have b = θ∗ and A = 1, so:
θt+1 = θt + α (θt − θ∗ + Nt ) (6)

t
(IV ) X
et+1 = (1 − α)t+1 e0 + α (1 − α)s Nt−s (7)
s=0
where (IV ) follows from the unrolling the recursion.

The mean behaviour of et :
t
X
E[et+1 ] = E[(1 − α)t+1 e0 + α (1 − α)s Nt−s ] (8)
s=0
(V )
E[et+1 ] = (1 − α)t+1 e0 (9)
where (V ) follows from the fact that E[Nt ] = 0.
The mean squared behaviour of et :
t
(V I) X
E[e2t+1 ] = E[(1 − α)e2t ] + E[α2 Nt2 ] + E[α (1 − α)s Nt−s (1 − α)t+1 et ] (10)
| {z } | {z } s=0
a2 b2 | {z }
2ab
t
(V I) X
E[et+1 ] = (1 − α)2(t+1) e20 + α2 σ 2 (1 − α)2s (11)
s=0
| {z }
geometric series
(V II) 1
< ρt e2 + α2 σ 2 (12)
|{z}0 1−ρ
forgetting initial condition | {z }
effect of noise
where (V I) is like (a + b)2 = a2 + b2 + 2ab, where 2ab is the cross term. The cross term is 0,
because it other than Nt rest of the quantities are constants, and we know E[Nt ] = 0.
(V I) is unrolling the recursion and using σ 2 = E[Nt2 ] to denote the variance of the noise term. Note
that for α < 1 we will have ρ = (1 − α)2 , and ρt will go down to 0 as t increases.
Pt P∞ 1
(V II) the inequality is because s=0 ρs < s=0 ρs = 1−ρ
1.4 Connection between eigenvalues, recursion, learning the mean
The only difference between the expressions in Equation (7) and Equation (2) is that in the case of
learning the mean the quantities are numbers and in the general case the quantities are vectors and
matrices. In the learning the mean case, we have b = 1 and A = 1 which is a very special case.
Convergence: Notice that the (1 − α) term is very similar to (I − αA). We know for 0 < α < 2,
(1 − α)t → 0. We expect the general case to behave like the specific case, i.e., learning the mean in
which b = 1, A = 1, and (1 − αA) = (1 − α).
Growth of a general matrix M t and eigenvalues: From linear algebra (Jordan decomposition) to
be specific, we know that any d × d matrix can be expressed as
M = U ΛU −1 (13)
t −1 −1 −1
M =U
| ΛU U ΛU{z . . . U ΛU } (14)
t times
M t = U Λt U −1 , (15)
5
where Λ is a Jordan matrix, which contains the eigenvalues in the main diagonal and the entries
immediately above the diagonal are either 0 or 1, and U contains the eigenvectors of M . Further, the
M and Λ have same eigenvalues. One can show (exercise) that rate of growth of Λt depends on its
largest eigenvalue. So, if the modulus of the largest eigenvalue is less than 1, both Λt and M t will
eventually go to the 0 matrix.
Growth of a symmetric matrix M t and eigenvalues: A symmetric matrix has special structure, it
can be decomposed as
M = U ΛU −1 , (16)
(17)
where U −1 = U > , i.e., U > U = U U > = I. Also, Λ is known to be a diagonal matrix.
Note: In our algorithms we have Mt = (I − αA)t . This is why we need the eigenvalues of A to
have positive real parts, if not, then I − αA will have an eigenvalue whose real part will be greater
than 1 and the matrix M t = (I − αA)t will blow up.
1.5 Matrix decomposition and Error Recursion:
Let A = U ΛU −1
t
X
U −1 et+1 = U −1 (I − αAU U −1 )t+1 e0 + U −1 α (I − αA)s U U −1 Nt−s (18)
s=0
t
X
ηt+1 = Λt+1 η0 + α Λs ζt−s , (19)
s=0
where ηt+1 = U −1 et and ζt = U −1 Nt , are error in the new basis (given by the matrix U ). Here, we
use the fact that Λ = U −1 AU .
The speciality of Equation (19) is that it is like a diagonal system, i.e., the behaviour of the error in
the new basis depends on Λt which is in turn dependent on the entires in its diagonal.

TD

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

TD

Загружено:

Авторское право:

Доступные форматы

Table 1: Summary of Temporal Difference Learning. In below n = |S| is the number of states.

Steps in the Analysis:

1.1 Common format

All algorithms in Tables 1 and 2 can be written in the below format:

1.2 Dynamics of Error

We have to make some assumptions to make our analysis.

θt+1 = θt + α (θt − θ∗ + Nt ) (6)

where (IV ) follows from the unrolling the recursion.

1.4 Connection between eigenvalues, recursion, learning the mean

1.5 Matrix decomposition and Error Recursion:

Вам также может понравиться