Вы находитесь на странице: 1из 33

An Introduction to Policy Search Methods

Thomas Furmston

January 23, 2017


Markov Decision Processes

Markov decision processes (MDPs) are the standard model


for optimal control in a fully observable environment.

Successful applications include,


robotics,
board games, such as chess, backgammon & go,
computer games, such as Tetris & Atari 2600 video games,
traffic management, elevator scheduling & helicopter flight
control.
Markov Decision Processes - Notation

A Markov decision process is described by the tuple


(S, A, D, P, R), in which,
S - state space (finite set)
A - action space (finite set)
D - initial state distribution
P - transition dynamics, which is a set of conditional
distributions over the state space, {P(|s, a)}(s,a)SA
R - reward function, which is a function R(, ) : S A R
Markov Decision Processes - Notation

Given a MDP we then have a policy, .

This is a set of conditional distributions over the action space,


{(|s)}sS , which is used to determine which action to take
given the current state of the environment.

The policy can be optimised in order to maximise an objective.


Markov Decision Processes - Sampling

1: Sample initial state : s1 D()


2: Sample initial action : a1 (|s = s1 )
3: for t = 1 to H do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 )
7: end for
Algorithm 1: pseudocode for sampling a trajectory from a
Markov decision process.
Tetris - An Example
Total Expected Reward

We shall consider the total expected reward with an


infinite horizon and a discounted reward.

Discount factor - [0, 1).

Objective function takes the form,



X  
t1
U() := Est ,at pt R(st , at ); , (1)
t=1

in which, pt , is the occupancy marginal at time t given, .


Value Functions & Dynamic Programming

Value functions are a core concept in Markov decision pro-


cesses.

The state value function is given by,



X  
Est ,at pt t1 R(st , at ) s1 = s; ,

V (s) :=
t=1

which satisfies the fixed point equation, known as the Bellman


equation,
 
0
 
V (s) = Ea(|s) R(s, a) + Es0 P(|s,a) V (s ) .
Value Functions & Dynamic Programming

The state-action value function is given by,



X  
t1

Q (s, a) := Est ,at pt R(st , at ) s1 = s; a1 = a; ,

t=1

which can also be written in terms of the state value function,


 
0
Q (s, a) := R(s, a) + Es0 P(|s,a) V (s ) .

The global optimum of (1) can be found through dynamic pro-


gramming.
Dynamic Programming

Dynamic programming is infeasible for many real-world prob-


lems of interest.

As a result, most research has focused on obtaining approxi-


mate or locally optimal solutions, including,
approximate dynamic programming methods,
tree search methods,
local trajectory-optimization techniques, e.g., differential dy-
namic programming,
policy search methods.
Policy Search Methods

Policy search methods are typically specialized applications of


techniques from numerical optimization.

As such, policy is given some differentiable parametric form, de-


noted (a|s; w ), or w , with policy parameters, w W Rn ,
n N.

For example,
> (a,s)
ew
(a|s; w ) = P w > (a0 ,s)
, (2)
a0 A e

with : A S Rn is a feature mapping.


Policy Search Methods

We overload notation and write the objective function directly in


terms of the parameter vector, i.e.,

U(w ) := U(w ), w W. (3)

Similarly, w W, we have,

V (s; w ) := V (s; w ), s S,
Q(s, a; w ) := Q(s, a; w ), (s, a) S A,
pt (s, a; w ) := pt (s, a; w ), (s, a) S A.
Policy Search Methods

Local information, such as the gradient of the objective function,


is used to update the policy in an incremental manner until con-
vergence to a local optimum.

Benefits include,
General convergence guarantees.
Good any time performance.
Only necessary to approximate a low-dimensional projec-
tion of the value function.
Easily extendible to models for partially observable environ-
ments, such as the finite state controllers.
Policy Gradient Theorem

Theorem (Policy Gradient Theorem [1])


Given a Markov decision process with objective (1), then for
any, w W, the gradient of (3) takes the form,

XX
U(w ) = p (s, a; w )Q(s, a; w ) log (a|s; w ),
w w
sS aA

in which,

X
p (s, a; w ) = t1 pt (s, a; w ).
t=1
An On-line Policy Search Method - Version 1

1: Sample initial state : s1 D()


2: Sample initial action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Calculate state-action value : Q(st , at ; wt )
8: Update policy :


wt+1 = wt + t Q(st , at ; wt ) log (at |st ; w )
w w =wt

9: end for
Algorithm 2: pseudo code for an on-line policy search method.
Compatible Function Approximation

Definition (Compatible Function Approximation [1])


Let fw : S A R be a function approximator to Qw , which is
parametrised by v Rn .

fw is said to be compatible with respect to a policy parametrisa-


tion if,
fw is linear in v, i.e. fw (s, a; v ) = v > (a, s),

v fw (s, a; v) = w log (a|s; w ).
Compatible Function Approximation
Theorem (Policy Gradient Theorem with Compatible
Function Approximation [1])
If fw is a function approximator that is compatible w.r.t. the
given policy parametrisation, and,

v = argminT (v; w ), (4)


v R

with,
XX  2
T (v; w ) = p (s, a; w ) Q(s, a; w ) f (s, a; v) , (5)
sS aA

then,
XX
U(w ) = p (s, a; w )fw (s, a; v ) log (a|s; w ).
w w
sS aA
An On-line Policy Search Method - Version 2
1: Sample initial state : s1 D()
2: Sample action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Optimise function approximation :

vt = argminT (v; wt )
vR

8: Update policy :


wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w )
w w =wt

9: end for
Algorithm 3: pseudo code for an on-line policy search method.
Actor-Critic Methods

However, performing the optimisation,

v = argminT (v; w ),
vR

at every time-step will generally be prohibitively expensive.

Also, wt+1 wt , implies that, vt+1 vt , which suggests that we


update the function approximation parameters in an incremental
manner.

These observations give rise to actor-critic methods.


Actor-Critic Methods

In these methods we iteratively optimise the policy parameters


and the function approximation parameters at the same time.

For example, at each iteration we could have,




wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w ),
w w =wt
vt+1 = vt + t g(wt ),

in which
g(wt ) is a step direction in the function approximation pa-
rameter space (algorithm dependent).
{t }
t=1 is a step-size sequence for the function approxima-
tion parameters.
Actor-Critic Methods

Different types of critic can be considered. For example, a batch-


based solution of the least squares problem (5).

Popular approach in literature is to use temporal difference


learning [2]. We follow approach of [2] and consider a linear
compatible critic learnt through TD(0).

In this case the critic update at the t th iteration takes the form,

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )


vt+1 = vt + t t (st , at ; wt )
An On-line Actor-Critic Algorithm
1: Sample initial state : s1 D()
2: Sample initial action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Update critic :

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )


vt+1 = vt + t t (st , at ; wt )

8: Update policy :


wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w )
w w =wt

9:end for
Algorithm 4: pseudo code for TD(0) actor-critic algorithm.
Actor-Critic Methods

To prove convergence we need the two step-size sequences to


satisfy the following criteria,
Robbins-Munro conditions,

X
X
t > 0, t, t = , t2 < ,
t=1 t=1

X
X
t > 0, t, t = , t2 < .
t=1 t=1

Policy parameters updated at a slower rate than function


approximation parameters,
t
lim = 0.
t t
Natural Gradient Ascent

Steepest gradient ascent often gives poor results in practice,


e.g., due to poor scaling of the objective function.

As a result alternative optimisation techniques are often consid-


ered.

A popular alternative is natural gradient ascent, which was in-


troduced into the policy search literature in the work of [2].
Natural Policy Gradients

In natural gradient ascent the parameter update takes the form,


w new = w + G1 (w ) U(w ),
w
in which, G(w), is the Fisher information matrix of the policy
distribution, averaged over the state distribution, i.e.,
XX >
G(w ) = p (s, a; w ) log (a|s; w ) log (a|s; w ).
w w
sS aA
Natural Actor-Critic

Theorem (Natural Policy Gradients with Compatible


Function Approximation [2])
Suppose that f is a linear function approximator that is compati-
ble w.r.t. the given policy parametrisation.

If v Rn are the optimal critic parameters, i.e., v minimises (5),


then,

v = G1 (w ) U(w ).
w
In other words, the natural gradient is given by the optimal critic
parameters.
An On-line Natural Actor-Critic Algorithm
1: Sample initial state : s1 D()
2: Sample initial action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Update critic :

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )


vt+1 = vt + t t (st , at ; wt )

8: Update policy :

wt+1 = wt + t vt

9: end for
Algorithm 5: pseudo code for TD(0) natural actor-critic algo-
rithm.
Tetris
As an example of policy gradients in action we consider the
Tetris domain.

We consider the parametrisation in (2).

For a given state-action pair we consider the following features,


each evaluated on the board that results from taking the given
action in the given state,
number of holes in the board,
column heights in the board,
difference in column heights,
maximum column height.

Total of 21 features.
Tetris
Bibliography - Policy Search Methods I

R. Sutton, D. McAllester, S. Singh, and Y. Mansour.


Policy gradient methods for reinforcement learning with
function approximation.
NIPS, 13, 2000.
S. Kakade.
A natural policy gradient.
NIPS, 14, 2002.
T. Furmston, G.Lever, and D. Barber.
Approximate Newton methods for policy search in Markov
decision processes.
Journal of Machine Learning Research, 17:151, 2016.
Bibliography - Actor-Critic Methods I

V. Konda and J. Tsitsiklis.


Actor-critic algorithms.
NIPS, 11:10081014, 1999.
S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and L. Mark.
Natural actor-critic algorithms.
Automatica, 45:24712482, 2009.
Bibliography - Policy Search Methods & Neural
Networks I

N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and


Y. Tassa.
Learning continuous control policies by stochastic value
gradients.
NIPS, 27:29262934, 2015.
T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra.
Continuous control with deep reinforcement learning.
ICLR, 4, 2016.
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and
P. Moritz.
Trust region policy optimization.
ICML, 32:18891897, 2015.
Bibliography - Misc I

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and


M. Riedmiller.
Deterministic policy gradient algorithms.
ICML, 31:387395, 2014.
R. Sutton.
Learning to predict by the method of temporal differences..
Machine Learning, 3:944, 1988.

Вам также может понравиться