11 views

Uploaded by fatalist3

f

- ExamC
- 10.1.1.24
- Managerial Economics2 Topic3
- Method of Moments
- 2005-Reliable Evaluation Method of Quality Control for Compressive Strength of Concrete
- Visualizing Context-Free Grammar and Active
- Artificial Intelligence 4
- Traffic Simultaion
- Solutions Supp C-Instructor Manual
- michaudValueFn
- Adaptive Hard Handoff Algorithms
- Impact of Confounder
- Chain Ladder Method
- Increasing Returns and All That
- Lab Assignment #1
- $$$MGB3rdSearchable.pdf
- Decision Modeling and Op Tim Ization in Game Design
- Optimal Design of Conical Springs Rg
- WolpertMacready1997_ NoFreeLunchThmsforOptimn_IEEETransEvolComputV1i1p67ÔÇô82
- GUNDERT JBiomechEngr 2012 Optimization of Cardiovascular Stent Design Using Computational Fluid Dyanmics (1)

You are on page 1of 103

Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 1 May 23, 2017

Administrative

Grades:

- Midterm grades released last night, see Piazza for more

information and statistics

- A2 and milestone grades scheduled for later this week

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 2 May 23, 2017

Administrative

Projects:

- All teams must register their project, see Piazza for registration

form

- Tiny ImageNet evaluation server is online

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 3 May 23, 2017

Administrative

Survey:

- Please fill out the course survey!

- Link on Piazza or https://goo.gl/forms/eQpVW7IPjqapsDkB2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 4 May 23, 2017

So far Supervised Learning

Data: (x, y)

x is data, y is label

Examples: Classification,

regression, object detection,

semantic segmentation, image Classification

captioning, etc.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 5 May 23, 2017

So far Unsupervised Learning

Data: x

Just data, no labels!

1-d density estimation

Goal: Learn some underlying

hidden structure of the data

Examples: Clustering,

dimensionality reduction, feature

learning, density estimation, etc. 2-d density estimation

2-d density images left and right

are CC0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 6 May 23, 2017

Today: Reinforcement Learning

interacting with an environment,

which provides numeric reward

signals

in order to maximize reward

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 7 May 23, 2017

Overview

- Markov Decision Processes

- Q-Learning

- Policy Gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 8 May 23, 2017

Reinforcement Learning

Agent

Environment

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 9 May 23, 2017

Reinforcement Learning

Agent

State st

Environment

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 10 May 23, 2017

Reinforcement Learning

Agent

State st

Action at

Environment

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 11 May 23, 2017

Reinforcement Learning

Agent

Environment

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 12 May 23, 2017

Reinforcement Learning

Agent

Next state st+1

Environment

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 13 May 23, 2017

Cart-Pole Problem

Action: horizontal force applied on the cart

Reward: 1 at each time step if the pole is upright

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 14 May 23, 2017

Robot Locomotion

Action: Torques applied on joints

Reward: 1 at each time step upright +

forward movement

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 15 May 23, 2017

Atari Games

Action: Game controls e.g. Left, Right, Up, Down

Reward: Score increase/decrease at each time step

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 16 May 23, 2017

Go

Action: Where to put the next piece down

Reward: 1 if win at the end of the game, 0 otherwise

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 17 May 23, 2017

How can we mathematically formalize the RL problem?

Agent

Next state st+1

Environment

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 18 May 23, 2017

Markov Decision Process

- Mathematical formulation of the RL problem

- Markov property: Current state completely characterises the state of the

world

Defined by:

: set of possible actions

: distribution of reward given (state, action) pair

: transition probability i.e. distribution over next state given (state, action) pair

: discount factor

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 19 May 23, 2017

Markov Decision Process

- At time step t=0, environment samples initial state s0 ~ p(s0)

- Then, for t=0 until done:

- Agent selects action at

- Environment samples reward rt ~ R( . | st, at)

- Environment samples next state st+1 ~ P( . | st, at)

- Agent receives reward rt and next state st+1

each state

- Objective: find policy * that maximizes cumulative discounted reward:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 20 May 23, 2017

A simple MDP: Grid World

states

actions = {

1. right

2. left Set a negative reward

for each transition

3. up

(e.g. r = -1)

4. down

}

least number of actions

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 21 May 23, 2017

A simple MDP: Grid World

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 22 May 23, 2017

The optimal policy *

We want to find optimal policy * that maximizes the sum of rewards.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 23 May 23, 2017

The optimal policy *

We want to find optimal policy * that maximizes the sum of rewards.

Maximize the expected sum of rewards!

Formally: with

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 24 May 23, 2017

Definitions: Value function and Q-value function

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 25 May 23, 2017

Definitions: Value function and Q-value function

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,

The value function at state s, is the expected cumulative reward from following the policy

from state s:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 26 May 23, 2017

Definitions: Value function and Q-value function

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,

The value function at state s, is the expected cumulative reward from following the policy

from state s:

The Q-value function at state s and action a, is the expected cumulative reward from

taking action a in state s and then following the policy:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 27 May 23, 2017

Bellman equation

The optimal Q-value function Q* is the maximum expected cumulative reward achievable

from a given (state, action) pair:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 28 May 23, 2017

Bellman equation

The optimal Q-value function Q* is the maximum expected cumulative reward achievable

from a given (state, action) pair:

Intuition: if the optimal state-action values for the next time-step Q*(s,a) are known,

then the optimal strategy is to take the action that maximizes the expected value of

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 29 May 23, 2017

Bellman equation

The optimal Q-value function Q* is the maximum expected cumulative reward achievable

from a given (state, action) pair:

Intuition: if the optimal state-action values for the next time-step Q*(s,a) are known,

then the optimal strategy is to take the action that maximizes the expected value of

The optimal policy * corresponds to taking the best action in any state as specified by Q*

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 30 May 23, 2017

Solving for the optimal policy

Value iteration algorithm: Use Bellman equation as an iterative update

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 31 May 23, 2017

Solving for the optimal policy

Value iteration algorithm: Use Bellman equation as an iterative update

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 32 May 23, 2017

Solving for the optimal policy

Value iteration algorithm: Use Bellman equation as an iterative update

Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state

pixels, computationally infeasible to compute for entire state space!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 33 May 23, 2017

Solving for the optimal policy

Value iteration algorithm: Use Bellman equation as an iterative update

Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state

pixels, computationally infeasible to compute for entire state space!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 34 May 23, 2017

Solving for the optimal policy: Q-learning

Q-learning: Use a function approximator to estimate the action-value function

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 35 May 23, 2017

Solving for the optimal policy: Q-learning

Q-learning: Use a function approximator to estimate the action-value function

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 36 May 23, 2017

Solving for the optimal policy: Q-learning

Q-learning: Use a function approximator to estimate the action-value function

If the function approximator is a deep neural network => deep q-learning!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 37 May 23, 2017

Solving for the optimal policy: Q-learning

Remember: want to find a Q-function that satisfies the Bellman Equation:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 38 May 23, 2017

Solving for the optimal policy: Q-learning

Remember: want to find a Q-function that satisfies the Bellman Equation:

Forward Pass

Loss function:

where

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 39 May 23, 2017

Solving for the optimal policy: Q-learning

Remember: want to find a Q-function that satisfies the Bellman Equation:

Forward Pass

Loss function:

where

Backward Pass

Gradient update (with respect to Q-function parameters ):

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 40 May 23, 2017

Solving for the optimal policy: Q-learning

Remember: want to find a Q-function that satisfies the Bellman Equation:

Forward Pass

Loss function:

Iteratively try to make the Q-value

where close to the target value (yi) it

should have, if Q-function

corresponds to optimal Q* (and

Backward Pass optimal policy *)

Gradient update (with respect to Q-function parameters ):

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 41 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Action: Game controls e.g. Left, Right, Up, Down

Reward: Score increase/decrease at each time step

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 42 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Q-network Architecture

:

FC-4 (Q-values)

neural network

with weights FC-256

(after RGB->grayscale conversion, downsampling, and cropping)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 43 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Q-network Architecture

:

FC-4 (Q-values)

neural network

with weights FC-256

Input: state st

(after RGB->grayscale conversion, downsampling, and cropping)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 44 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Q-network Architecture

:

FC-4 (Q-values)

neural network

with weights FC-256

Familiar conv layers,

FC layer

16 8x8 conv, stride 4

(after RGB->grayscale conversion, downsampling, and cropping)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 45 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Q-network Architecture

: Last FC layer has 4-d

FC-4 (Q-values)

neural network output (if 4 actions),

with weights FC-256 corresponding to Q(st,

a1), Q(st, a2), Q(st, a3),

32 4x4 conv, stride 2 Q(st,a4)

16 8x8 conv, stride 4

(after RGB->grayscale conversion, downsampling, and cropping)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 46 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Q-network Architecture

: Last FC layer has 4-d

FC-4 (Q-values)

neural network output (if 4 actions),

with weights FC-256 corresponding to Q(st,

a1), Q(st, a2), Q(st, a3),

32 4x4 conv, stride 2 Q(st,a4)

16 8x8 conv, stride 4

Number of actions between 4-18

depending on Atari game

(after RGB->grayscale conversion, downsampling, and cropping)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 47 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Q-network Architecture

: Last FC layer has 4-d

FC-4 (Q-values)

neural network output (if 4 actions),

with weights FC-256 corresponding to Q(st,

a1), Q(st, a2), Q(st, a3),

32 4x4 conv, stride 2 Q(st,a4)

A single feedforward pass

to compute Q-values for all 16 8x8 conv, stride 4

actions from the current Number of actions between 4-18

state => efficient! depending on Atari game

(after RGB->grayscale conversion, downsampling, and cropping)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 48 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Remember: want to find a Q-function that satisfies the Bellman Equation:

Forward Pass

Loss function:

Iteratively try to make the Q-value

where close to the target value (yi) it

should have, if Q-function

corresponds to optimal Q* (and

Backward Pass optimal policy *)

Gradient update (with respect to Q-function parameters ):

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 49 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Learning from batches of consecutive samples is problematic:

- Samples are correlated => inefficient learning

- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand

size) => can lead to bad feedback loops

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 50 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Learning from batches of consecutive samples is problematic:

- Samples are correlated => inefficient learning

- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand

size) => can lead to bad feedback loops

- Continually update a replay memory table of transitions (st, at, rt, st+1) as game

(experience) episodes are played

- Train Q-network on random minibatches of transitions from the replay memory,

instead of consecutive samples

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 51 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Learning from batches of consecutive samples is problematic:

- Samples are correlated => inefficient learning

- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand

size) => can lead to bad feedback loops

- Continually update a replay memory table of transitions (st, at, rt, st+1) as game

(experience) episodes are played

- Train Q-network on random minibatches of transitions from the replay memory,

instead of consecutive samples

Each transition can also contribute

to multiple weight updates

=> greater data efficiency

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 52 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 53 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 54 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 55 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Initialize state

(starting game

screen pixels) at the

beginning of each

episode

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 56 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

of the game

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 57 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

select a random

action (explore),

otherwise select

greedy action from

current policy

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 58 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

and observe the

reward rt and next

state st+1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 59 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Store transition in

replay memory

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 60 May 23, 2017

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Experience Replay:

Sample a random

minibatch of transitions

from replay memory

and perform a gradient

descent step

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 61 May 23, 2017

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Video by Kroly Zsolnai-Fehr. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 62 May 23, 2017

Policy Gradients

What is a problem with Q-learning?

The Q-function can be very complicated!

Example: a robot grasping an object has a very high-dimensional state => hard

to learn exact value of every (state, action) pair

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 63 May 23, 2017

Policy Gradients

What is a problem with Q-learning?

The Q-function can be very complicated!

Example: a robot grasping an object has a very high-dimensional state => hard

to learn exact value of every (state, action) pair

But the policy can be much simpler: just close your hand

Can we learn a policy directly, e.g. finding the best policy from a collection of

policies?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 64 May 23, 2017

Policy Gradients

Formally, lets define a class of parametrized policies:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 65 May 23, 2017

Policy Gradients

Formally, lets define a class of parametrized policies:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 66 May 23, 2017

Policy Gradients

Formally, lets define a class of parametrized policies:

Gradient ascent on policy parameters!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 67 May 23, 2017

REINFORCE algorithm

Mathematically, we can write:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 68 May 23, 2017

REINFORCE algorithm

Expected reward:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 69 May 23, 2017

REINFORCE algorithm

Expected reward:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 70 May 23, 2017

REINFORCE algorithm

Expected reward:

expectation is problematic when p

depends on

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 71 May 23, 2017

REINFORCE algorithm

Expected reward:

expectation is problematic when p

depends on

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 72 May 23, 2017

REINFORCE algorithm

Expected reward:

expectation is problematic when p

depends on

If we inject this back:

Monte Carlo sampling

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 73 May 23, 2017

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities?

We have:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 74 May 23, 2017

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities?

We have:

Thus:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 75 May 23, 2017

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities?

We have:

Thus:

Doesnt depend on

And when differentiating: transition probabilities!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 76 May 23, 2017

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities?

We have:

Thus:

Doesnt depend on

And when differentiating: transition probabilities!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 77 May 23, 2017

Intuition

Gradient estimator:

Interpretation:

- If r( ) is high, push up the probabilities of the actions seen

- If r( ) is low, push down the probabilities of the actions seen

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 78 May 23, 2017

Intuition

Gradient estimator:

Interpretation:

- If r( ) is high, push up the probabilities of the actions seen

- If r( ) is low, push down the probabilities of the actions seen

Might seem simplistic to say that if a trajectory is good then all its actions were

good. But in expectation, it averages out!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 79 May 23, 2017

Intuition

Gradient estimator:

Interpretation:

- If r( ) is high, push up the probabilities of the actions seen

- If r( ) is low, push down the probabilities of the actions seen

Might seem simplistic to say that if a trajectory is good then all its actions were

good. But in expectation, it averages out!

However, this also suffers from high variance because credit assignment is

really hard. Can we help the estimator?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 80 May 23, 2017

Variance reduction

Gradient estimator:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 81 May 23, 2017

Variance reduction

Gradient estimator:

future reward from that state

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 82 May 23, 2017

Variance reduction

Gradient estimator:

future reward from that state

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 83 May 23, 2017

Variance reduction: Baseline

Problem: The raw value of a trajectory isnt necessarily meaningful. For

example, if rewards are all positive, you keep pushing up probabilities of

actions.

What is important then? Whether a reward is better or worse than what you

expect to get

Concretely, estimator is now:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 84 May 23, 2017

How to choose the baseline?

from all trajectories

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 85 May 23, 2017

How to choose the baseline?

from all trajectories

REINFORCE

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 86 May 23, 2017

How to choose the baseline?

A better baseline: Want to push up the probability of an action from a state, if

this action was better than the expected value of what we should get from

that state.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 87 May 23, 2017

How to choose the baseline?

A better baseline: Want to push up the probability of an action from a state, if

this action was better than the expected value of what we should get from

that state.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 88 May 23, 2017

How to choose the baseline?

A better baseline: Want to push up the probability of an action from a state, if

this action was better than the expected value of what we should get from

that state.

Intuitively, we are happy with an action at in a state st if

is large. On the contrary, we are unhappy with an action if its small.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 89 May 23, 2017

How to choose the baseline?

A better baseline: Want to push up the probability of an action from a state, if

this action was better than the expected value of what we should get from

that state.

Intuitively, we are happy with an action at in a state st if

is large. On the contrary, we are unhappy with an action if its small.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 90 May 23, 2017

Actor-Critic Algorithm

Problem: we dont know Q and V. Can we learn them?

by training both an actor (the policy) and a critic (the Q-function).

- The actor decides which action to take, and the critic tells the actor

how good its action was and how it should adjust

- Also alleviates the task of the critic as it only has to learn the values

of (state, action) pairs generated by the policy

- Can also incorporate Q-learning tricks e.g. experience replay

- Remark: we can define by the advantage function how much an

action was better than expected

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 91 May 23, 2017

Actor-Critic Algorithm

Initialize policy parameters , critic parameters

For iteration=1, 2 do

Sample m trajectories under the current policy

For i=1, , m do

For t=1, ... , T do

End for

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 92 May 23, 2017

REINFORCE in action: Recurrent Attention Model (RAM)

Objective: Image Classification

image, to predict class

- Inspiration from human perception and eye movements

- Saves computational resources => scalability

- Able to ignore clutter / irrelevant parts of image

Action: (x,y) coordinates (center of glimpse) of where to look next in image

Reward: 1 at the final timestep if image correctly classified, 0 otherwise

glimpse

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 93 May 23, 2017

REINFORCE in action: Recurrent Attention Model (RAM)

Objective: Image Classification

image, to predict class

- Inspiration from human perception and eye movements

- Saves computational resources => scalability

- Able to ignore clutter / irrelevant parts of image

Action: (x,y) coordinates (center of glimpse) of where to look next in image

Reward: 1 at the final timestep if image correctly classified, 0 otherwise

glimpse

Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCE

Given state of glimpses seen so far, use RNN to model the state and output next action

[Mnih et al. 2014]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 94 May 23, 2017

REINFORCE in action: Recurrent Attention Model (RAM)

(x1, y1)

Input NN

image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 95 May 23, 2017

REINFORCE in action: Recurrent Attention Model (RAM)

Input NN NN

image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 96 May 23, 2017

REINFORCE in action: Recurrent Attention Model (RAM)

Input NN NN NN

image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 97 May 23, 2017

REINFORCE in action: Recurrent Attention Model (RAM)

Input NN NN NN NN

image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 98 May 23, 2017

REINFORCE in action: Recurrent Attention Model (RAM)

(x1, y1) (x2, y2) (x3, y3) (x4, y4) (x5, y5)

Softmax

Input NN NN NN NN NN y=2

image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - 99 May 23, 2017

REINFORCE in action: Recurrent Attention Model (RAM)

Has also been used in many other tasks including fine-grained image recognition,

image captioning, and visual question-answering!

10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

0

More policy gradients: AlphaGo

Overview:

- Mix of supervised learning and reinforcement learning

- Mix of old methods (Monte Carlo Tree Search) and

recent ones (deep RL)

- Featurize the board (stone color, move legality, bias, )

- Initialize policy network with supervised training from professional go games,

then continue training using policy gradient (play against itself from random

previous iterations, +1 / -1 reward for winning / losing)

- Also learn value network (critic)

- Finally, combine combine policy and value networks in a Monte Carlo Tree [Silver et al.,

Search algorithm to select actions by lookahead search Nature 2016]

This image is CC0 public domain

10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

1

Summary

- Policy gradients: very general but suffer from high variance so

requires a lot of samples. Challenge: sample-efficiency

- Q-learning: does not always work but when it works, usually more

sample-efficient. Challenge: exploration

- Guarantees:

- Policy Gradients: Converges to a local minima of J( ), often good

enough!

- Q-learning: Zero guarantees since you are approximating Bellman

equation with a complicated function approximator

10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

2

Next Time

Guest Lecture: Song Han

- Energy-efficient deep learning

- Deep learning hardware

- Model compression

- Embedded systems

- And more...

10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

3

- ExamCUploaded byYuvendren Lingam
- 10.1.1.24Uploaded byamdocm
- Managerial Economics2 Topic3Uploaded byFermin Jr S Gallardo
- Method of MomentsUploaded byARUPARNA MAITY
- 2005-Reliable Evaluation Method of Quality Control for Compressive Strength of ConcreteUploaded byAndi Cxfriends
- Visualizing Context-Free Grammar and ActiveUploaded byAuthor
- Artificial Intelligence 4Uploaded byNat Geo
- Traffic SimultaionUploaded byYessy
- Solutions Supp C-Instructor ManualUploaded byLinda Atcha
- michaudValueFnUploaded byYu Zhang
- Adaptive Hard Handoff AlgorithmsUploaded byfajar_syahputra_1
- Impact of ConfounderUploaded bymeksy
- Chain Ladder MethodUploaded byTaitip Suphasiriwattana
- Increasing Returns and All ThatUploaded byAlfonso Palomino Arias
- Lab Assignment #1Uploaded byWali Kamal
- $$$MGB3rdSearchable.pdfUploaded byMarzieh Rostami
- Decision Modeling and Op Tim Ization in Game DesignUploaded byRichard Mcdaniel
- Optimal Design of Conical Springs RgUploaded byRishabh Trivedi
- WolpertMacready1997_ NoFreeLunchThmsforOptimn_IEEETransEvolComputV1i1p67ÔÇô82Uploaded byalexandru_bratu_6
- GUNDERT JBiomechEngr 2012 Optimization of Cardiovascular Stent Design Using Computational Fluid Dyanmics (1)Uploaded byDeva Raj
- abethesisUploaded byA.K. Mars

- cs231n_2017_lecture1 - Intro.pdfUploaded bysboby
- 3_oopUploaded byfatalist3
- cs231n_2017_lecture15Uploaded byfatalist3
- Nature of Code Assignment Instructions Session 1Uploaded byfatalist3
- Cc Beginner Guide Machine Learn UnknownUploaded byfatalist3
- Enonce SimonidesUploaded byfatalist3
- Deep Learning LectureUploaded byNhưHoaMạc
- cs231n_2017_lecture12Uploaded byfatalist3
- cs231n_2017_lecture16Uploaded byfatalist3
- cs231n_2017_lecture11Uploaded byfatalist3
- cs231n_2017_lecture13Uploaded byfatalist3
- Nature of Code Assignment 2 InstructionsUploaded byfatalist3
- JW Museum Bible Tours - The British Museum - All Artefacts Room OrderUploaded byfatalist3
- FlyingCar-WhatYoullLearnUploaded byMiniMithi
- Dean Nips17Uploaded byestevaocanan85
- chapter03.pdfUploaded byfatalist3
- Chapter 16Uploaded byfatalist3
- cs231n_2017_lecture4Uploaded byTarik
- Chapter 09 bUploaded byfatalist3
- cs231n_2017_lecture3Uploaded byfatalist3
- Chapter 17 AUploaded byfatalist3
- chapter11.pdfUploaded byfatalist3
- cs231n_2017_lecture2Uploaded byfatalist3
- Chapter 13Uploaded byfatalist3
- Chapter 14Uploaded byfatalist3
- chapter06.pdfUploaded byfatalist3
- chapter15a.pdfUploaded byfatalist3
- cs231n_2017_lecture5Uploaded byfatalist3
- chapter15b.pdfUploaded byfatalist3

- s503-rmaUploaded byafzal0007
- oae 043 special educationUploaded byapi-385281264
- Lab 3 Significant FiguresUploaded bynerilar
- Holistic Processing is Finely Tuned for Faces of One's Own Race (2006)Uploaded bytikijade
- 628564Uploaded byBui Xuan Di
- Normal DistributionUploaded byARUPARNA MAITY
- sigi 3 portfolio johnson gusUploaded byapi-400615124
- Effects of Workplace Friendship on Employee Job Satisfaction Org.pdfUploaded byanon_212220538
- Adsp June 2010Uploaded bytamil_delhi
- +Minor concerns - rfildren and childhood in British museumsUploaded byYue_Tenou
- Hdr2016 Technical NotesUploaded byEriola Adenidji
- GALENTIC SOIL DATA.pdfUploaded bypravin1307
- History of Measurement (2)Uploaded byNtakirutimana Manassé
- Cluster ClassUploaded byJitendra K Jha
- Bstat TestUploaded by89639
- BA7108 Written CommunicationUploaded byaman
- Stat Question Set #7 AnsUploaded bySam H. Saleh
- summable.pdfUploaded bymanusansano
- Castillo-Carniglia Et Al. - 2015 - Adaptation and Validation of the Instrument Treatment Outcomes Profile to the Chilean PopulationUploaded byDaniel
- Data Preparation: Stationarity (Excel)Uploaded bySpider Financial
- Exercises CourserUploaded byIngMasaruMata
- Surveying IIIUploaded byShaik Jhoir
- A Brief Literature Review on Consumer Buying BehaviourUploaded byShubhangi Ghadge
- v46i04.pdfUploaded bySlayman Alqahtani
- Anderson and Gerbing 1988Uploaded byMinh Nguyen
- Situasi Publik Dan OrganisasiUploaded byAditya H Soehermann
- Statistics: Experimental Design and ANOVAUploaded byJasmin Tai
- Arroyo 2008 Forest Ecology and ManagementUploaded byFederico Gastón Barragán
- Geological and Mineral Potencial MappingUploaded byldsancristobal
- Physics Force Table Lab ReportUploaded byalightning123