An Introduction To Policy Search Methods: Thomas Furmston

An Introduction to Policy Search Methods
Thomas Furmston
January 23, 2017

Markov Decision Processes
Markov decision processes (MDPs) are the standard model

for optimal control in a fully observable environment.
Successful applications include,

robotics,
board games, such as chess, backgammon & go,
computer games, such as Tetris & Atari 2600 video games,
traffic management, elevator scheduling & helicopter flight
control.
Markov Decision Processes - Notation
A Markov decision process is described by the tuple

(S, A, D, P, R), in which,
S - state space (finite set)
A - action space (finite set)
D - initial state distribution
P - transition dynamics, which is a set of conditional
distributions over the state space, {P(|s, a)}(s,a)SA
R - reward function, which is a function R(, ) : S A R
Markov Decision Processes - Notation
Given a MDP we then have a policy, .
This is a set of conditional distributions over the action space,

{(|s)}sS , which is used to determine which action to take
given the current state of the environment.
The policy can be optimised in order to maximise an objective.

Markov Decision Processes - Sampling
1: Sample initial state : s1 D()

2: Sample initial action : a1 (|s = s1 )
3: for t = 1 to H do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 )
7: end for
Algorithm 1: pseudocode for sampling a trajectory from a
Markov decision process.
Tetris - An Example
Total Expected Reward
We shall consider the total expected reward with an

infinite horizon and a discounted reward.
Discount factor - [0, 1).
Objective function takes the form,

X
t1
U() := Est ,at pt R(st , at ); , (1)
t=1
in which, pt , is the occupancy marginal at time t given, .

Value Functions & Dynamic Programming
Value functions are a core concept in Markov decision pro-

cesses.
The state value function is given by,

X
Est ,at pt t1 R(st , at )s1 = s; ,

V (s) :=
t=1
which satisfies the fixed point equation, known as the Bellman

equation,

0

V (s) = Ea(|s) R(s, a) + Es0 P(|s,a) V (s ) .
Value Functions & Dynamic Programming
The state-action value function is given by,

X
t1

Q (s, a) := Est ,at pt R(st , at )s1 = s; a1 = a; ,

t=1
which can also be written in terms of the state value function,

0
Q (s, a) := R(s, a) + Es0 P(|s,a) V (s ) .
The global optimum of (1) can be found through dynamic pro-

gramming.
Dynamic Programming
Dynamic programming is infeasible for many real-world prob-

lems of interest.
As a result, most research has focused on obtaining approxi-

mate or locally optimal solutions, including,
approximate dynamic programming methods,
tree search methods,
local trajectory-optimization techniques, e.g., differential dy-
namic programming,
policy search methods.
Policy Search Methods
Policy search methods are typically specialized applications of

techniques from numerical optimization.
As such, policy is given some differentiable parametric form, de-

noted (a|s; w ), or w , with policy parameters, w W Rn ,
n N.
For example,
> (a,s)
ew
(a|s; w ) = P w > (a0 ,s)
, (2)
a0 A e
with : A S Rn is a feature mapping.

We overload notation and write the objective function directly in

terms of the parameter vector, i.e.,
U(w ) := U(w ), w W. (3)
Similarly, w W, we have,
V (s; w ) := V (s; w ), s S,
Q(s, a; w ) := Q(s, a; w ), (s, a) S A,
pt (s, a; w ) := pt (s, a; w ), (s, a) S A.
Local information, such as the gradient of the objective function,

is used to update the policy in an incremental manner until con-
vergence to a local optimum.
Benefits include,
General convergence guarantees.
Good any time performance.
Only necessary to approximate a low-dimensional projec-
tion of the value function.
Easily extendible to models for partially observable environ-
ments, such as the finite state controllers.
Policy Gradient Theorem
Theorem (Policy Gradient Theorem [1])

Given a Markov decision process with objective (1), then for
any, w W, the gradient of (3) takes the form,
XX
U(w ) = p (s, a; w )Q(s, a; w ) log (a|s; w ),
w w
sS aA
in which,

X
p (s, a; w ) = t1 pt (s, a; w ).
t=1
An On-line Policy Search Method - Version 1

2: Sample initial action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Calculate state-action value : Q(st , at ; wt )
8: Update policy :

wt+1 = wt + t Q(st , at ; wt ) log (at |st ; w )
w w =wt
9: end for
Algorithm 2: pseudo code for an on-line policy search method.
Compatible Function Approximation
Definition (Compatible Function Approximation [1])

Let fw : S A R be a function approximator to Qw , which is
parametrised by v Rn .
fw is said to be compatible with respect to a policy parametrisa-

tion if,
fw is linear in v, i.e. fw (s, a; v ) = v > (a, s),

v fw (s, a; v) = w log (a|s; w ).
Compatible Function Approximation
Theorem (Policy Gradient Theorem with Compatible
Function Approximation [1])
If fw is a function approximator that is compatible w.r.t. the
given policy parametrisation, and,
v = argminT (v; w ), (4)

v R
with,
XX 2
T (v; w ) = p (s, a; w ) Q(s, a; w ) f (s, a; v) , (5)
sS aA
then,
XX
U(w ) = p (s, a; w )fw (s, a; v ) log (a|s; w ).
w w
sS aA
An On-line Policy Search Method - Version 2
2: Sample action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
7: Optimise function approximation :
vt = argminT (v; wt )
vR
8: Update policy :

wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w )
w w =wt
9: end for
Algorithm 3: pseudo code for an on-line policy search method.
Actor-Critic Methods
However, performing the optimisation,
v = argminT (v; w ),
vR
at every time-step will generally be prohibitively expensive.
Also, wt+1 wt , implies that, vt+1 vt , which suggests that we

update the function approximation parameters in an incremental
manner.
These observations give rise to actor-critic methods.

In these methods we iteratively optimise the policy parameters

and the function approximation parameters at the same time.
For example, at each iteration we could have,

wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w ),
w w =wt
vt+1 = vt + t g(wt ),
in which
g(wt ) is a step direction in the function approximation pa-
rameter space (algorithm dependent).
{t }
t=1 is a step-size sequence for the function approxima-
tion parameters.
Different types of critic can be considered. For example, a batch-

based solution of the least squares problem (5).
Popular approach in literature is to use temporal difference

learning [2]. We follow approach of [2] and consider a linear
compatible critic learnt through TD(0).
In this case the critic update at the t th iteration takes the form,
t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )

vt+1 = vt + t t (st , at ; wt )
An On-line Actor-Critic Algorithm
3: for t = 1 to do
7: Update critic :

vt+1 = vt + t t (st , at ; wt )
8: Update policy :

wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w )
w w =wt
9:end for
Algorithm 4: pseudo code for TD(0) actor-critic algorithm.
To prove convergence we need the two step-size sequences to

satisfy the following criteria,
Robbins-Munro conditions,

X
X
t > 0, t, t = , t2 < ,
t=1 t=1

X
X
t > 0, t, t = , t2 < .
t=1 t=1
Policy parameters updated at a slower rate than function

approximation parameters,
t
lim = 0.
t t
Natural Gradient Ascent
Steepest gradient ascent often gives poor results in practice,

e.g., due to poor scaling of the objective function.
As a result alternative optimisation techniques are often consid-

ered.
A popular alternative is natural gradient ascent, which was in-

troduced into the policy search literature in the work of [2].
Natural Policy Gradients
In natural gradient ascent the parameter update takes the form,

w new = w + G1 (w ) U(w ),
w
in which, G(w), is the Fisher information matrix of the policy
distribution, averaged over the state distribution, i.e.,
XX >
G(w ) = p (s, a; w ) log (a|s; w ) log (a|s; w ).
w w
sS aA
Natural Actor-Critic
Theorem (Natural Policy Gradients with Compatible

Function Approximation [2])
Suppose that f is a linear function approximator that is compati-
ble w.r.t. the given policy parametrisation.
If v Rn are the optimal critic parameters, i.e., v minimises (5),

then,

v = G1 (w ) U(w ).
w
In other words, the natural gradient is given by the optimal critic
parameters.
An On-line Natural Actor-Critic Algorithm
3: for t = 1 to do
7: Update critic :

vt+1 = vt + t t (st , at ; wt )
8: Update policy :
wt+1 = wt + t vt
9: end for
Algorithm 5: pseudo code for TD(0) natural actor-critic algo-
rithm.
Tetris
As an example of policy gradients in action we consider the
Tetris domain.
We consider the parametrisation in (2).
For a given state-action pair we consider the following features,

each evaluated on the board that results from taking the given
action in the given state,
number of holes in the board,
column heights in the board,
difference in column heights,
maximum column height.
Total of 21 features.
Tetris
Bibliography - Policy Search Methods I
R. Sutton, D. McAllester, S. Singh, and Y. Mansour.

Policy gradient methods for reinforcement learning with
function approximation.
NIPS, 13, 2000.
S. Kakade.
A natural policy gradient.
NIPS, 14, 2002.
T. Furmston, G.Lever, and D. Barber.
Approximate Newton methods for policy search in Markov
decision processes.
Journal of Machine Learning Research, 17:151, 2016.
Bibliography - Actor-Critic Methods I
V. Konda and J. Tsitsiklis.

Actor-critic algorithms.
NIPS, 11:10081014, 1999.
S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and L. Mark.
Natural actor-critic algorithms.
Automatica, 45:24712482, 2009.
Bibliography - Policy Search Methods & Neural
Networks I
N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and

Y. Tassa.
Learning continuous control policies by stochastic value
gradients.
NIPS, 27:29262934, 2015.
T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra.
Continuous control with deep reinforcement learning.
ICLR, 4, 2016.
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and
P. Moritz.
Trust region policy optimization.
ICML, 32:18891897, 2015.
Bibliography - Misc I
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and

M. Riedmiller.
Deterministic policy gradient algorithms.
ICML, 31:387395, 2014.
R. Sutton.
Learning to predict by the method of temporal differences..
Machine Learning, 3:944, 1988.

An Introduction To Policy Search Methods: Thomas Furmston

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

An Introduction To Policy Search Methods: Thomas Furmston

Загружено:

Авторское право:

Доступные форматы

An Introduction to Policy Search Methods

January 23, 2017

Markov decision processes (MDPs) are the standard model

Successful applications include,

A Markov decision process is described by the tuple

Given a MDP we then have a policy, .

This is a set of conditional distributions over the action space,

The policy can be optimised in order to maximise an objective.

1: Sample initial state : s1 D()

We shall consider the total expected reward with an

Discount factor - [0, 1).

Objective function takes the form,

in which, pt , is the occupancy marginal at time t given, .

Value functions are a core concept in Markov decision pro-

The state value function is given by,

which satisfies the fixed point equation, known as the Bellman

The state-action value function is given by,

which can also be written in terms of the state value function,

The global optimum of (1) can be found through dynamic pro-

Dynamic programming is infeasible for many real-world prob-

As a result, most research has focused on obtaining approxi-

Policy search methods are typically specialized applications of

As such, policy is given some differentiable parametric form, de-

with : A S Rn is a feature mapping.

We overload notation and write the objective function directly in

U(w ) := U(w ), w W. (3)

Local information, such as the gradient of the objective function,

Theorem (Policy Gradient Theorem [1])

1: Sample initial state : s1 D()

Definition (Compatible Function Approximation [1])

fw is said to be compatible with respect to a policy parametrisa-

v = argminT (v; w ), (4)

However, performing the optimisation,

at every time-step will generally be prohibitively expensive.

Also, wt+1 wt , implies that, vt+1 vt , which suggests that we

These observations give rise to actor-critic methods.

In these methods we iteratively optimise the policy parameters

For example, at each iteration we could have,

Different types of critic can be considered. For example, a batch-

Popular approach in literature is to use temporal difference

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )

To prove convergence we need the two step-size sequences to

Policy parameters updated at a slower rate than function

Steepest gradient ascent often gives poor results in practice,

As a result alternative optimisation techniques are often consid-

A popular alternative is natural gradient ascent, which was in-

In natural gradient ascent the parameter update takes the form,

Theorem (Natural Policy Gradients with Compatible

If v Rn are the optimal critic parameters, i.e., v minimises (5),

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )

We consider the parametrisation in (2).

For a given state-action pair we consider the following features,

R. Sutton, D. McAllester, S. Singh, and Y. Mansour.

V. Konda and J. Tsitsiklis.

N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and

Вам также может понравиться