Академический Документы
Профессиональный Документы
Культура Документы
Thomas Furmston
For example,
> (a,s)
ew
(a|s; w ) = P w > (a0 ,s)
, (2)
a0 A e
Similarly, w W, we have,
V (s; w ) := V (s; w ), s S,
Q(s, a; w ) := Q(s, a; w ), (s, a) S A,
pt (s, a; w ) := pt (s, a; w ), (s, a) S A.
Policy Search Methods
Benefits include,
General convergence guarantees.
Good any time performance.
Only necessary to approximate a low-dimensional projec-
tion of the value function.
Easily extendible to models for partially observable environ-
ments, such as the finite state controllers.
Policy Gradient Theorem
XX
U(w ) = p (s, a; w )Q(s, a; w ) log (a|s; w ),
w w
sS aA
in which,
X
p (s, a; w ) = t1 pt (s, a; w ).
t=1
An On-line Policy Search Method - Version 1
9: end for
Algorithm 2: pseudo code for an on-line policy search method.
Compatible Function Approximation
with,
XX 2
T (v; w ) = p (s, a; w ) Q(s, a; w ) f (s, a; v) , (5)
sS aA
then,
XX
U(w ) = p (s, a; w )fw (s, a; v ) log (a|s; w ).
w w
sS aA
An On-line Policy Search Method - Version 2
1: Sample initial state : s1 D()
2: Sample action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Optimise function approximation :
vt = argminT (v; wt )
vR
8: Update policy :
wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w )
w w =wt
9: end for
Algorithm 3: pseudo code for an on-line policy search method.
Actor-Critic Methods
v = argminT (v; w ),
vR
in which
g(wt ) is a step direction in the function approximation pa-
rameter space (algorithm dependent).
{t }
t=1 is a step-size sequence for the function approxima-
tion parameters.
Actor-Critic Methods
In this case the critic update at the t th iteration takes the form,
8: Update policy :
wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w )
w w =wt
9:end for
Algorithm 4: pseudo code for TD(0) actor-critic algorithm.
Actor-Critic Methods
w new = w + G1 (w ) U(w ),
w
in which, G(w), is the Fisher information matrix of the policy
distribution, averaged over the state distribution, i.e.,
XX >
G(w ) = p (s, a; w ) log (a|s; w ) log (a|s; w ).
w w
sS aA
Natural Actor-Critic
8: Update policy :
wt+1 = wt + t vt
9: end for
Algorithm 5: pseudo code for TD(0) natural actor-critic algo-
rithm.
Tetris
As an example of policy gradients in action we consider the
Tetris domain.
Total of 21 features.
Tetris
Bibliography - Policy Search Methods I