Вы находитесь на странице: 1из 2

CS6700 : Reinforcement Learning

Written Assignment #1
Intro to RL and Bandits Deadline: 16 Feb 2018, 11:55 pm
• This is an individual assignment. Collaborations and discussions are strictly prohibited.
• Be precise with your explanations. Unnecessary verbosity will be penalized.
• Check the Moodle discussion forums regularly for updates regarding the assignment.
• Please start early.
• Turn in only the answers on Turnitin.

1. The results shown in Figure 2.3 (of course text book uploaded in moodle) should be
quite reliable because they are averages over 2000 individual, randomly chosen 10-armed
bandit tasks. Why, then, are there oscillations and spikes in the early part of the curve
for the optimistic method? In other words, what might make this method perform
particularly better or worse, on average, on particular early steps?

2. Many tic-tac-toe positions appear different but are really the same because of symme-
tries. How might we amend the learning process described above to take advantage of
this? In what ways would this change improve the learning process?
Now think again. Suppose the opponent did not take advantage of symmetries. In
that case, should we? Is it true, then, that symmetrically equivalent positions should
necessarily have the same value?

3. Suppose, instead of playing against a random opponent, the reinforcement learning al-
gorithm described above played against itself, with both sides learning. What do you
think would happen in this case? Would it learn a different policy for selecting moves?

4. RL systems do not have to be ’taught’ by knowledgeable ’teachers’; they learn from


their own experiences. But teachers of various types can still be helpful. Describe two
different ways in which a teacher might facilitate RL. For each, explain how it can make
learning more efficient.

5. Consider a multi-armed bandit setup where horizon is T = 10000 timesteps and the
number of arms K = 100. After every 1000 timesteps the distribution changes. For eg:
for t = 1 to 1000 the arm a10 was the optimal arm while from t = 1000 to 2000 arm a80
becomes the optimal arm. The distribution for the other arms also changes. The point
from where the distribution changes is termed as breakpoint. So in this environment
there are 9 breakpoints.

• Between UCB1 and softmax algorithm which one will you choose in this setting?
Justify your answer.
• Devise a modified UCB1 algorithm which will work in this setting and justify your
choice. Will this work better than softmax? Why? (Hint: Use your ideas from
MDP.)

6. Define a bandit set up as follows. At each time instant for each arm of the bandit we
sample a reward from some unknown distribution. Now the agent picks an arm. The
environment then reveals all the rewards that were chosen. Regret is now defined as
the difference between the best arm at that instant and the one chosen summed over
all times steps. Would the existing algorithms for bandit problems work well in this
setting? Can we do better by taking advantage of the fact that all rewards are revealed?
For e.g., exploration is not an issue now, since all arms are revealed at each time step.

7. Suppose you face a 2-armed bandit task whose true action values change randomly from
time step to time step. Specifically, suppose that, for any time step, the true values of
actions 1 and 2 are respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and
0.8 with probability 0.5 (case B). If you are not able to tell which case you face at any
step, what is the best expectation of success you can achieve and how should you behave
to achieve it? Now suppose that on each step you are told whether you are facing case
A or case B (although you still dont know the true action values). This is an associative
search task. What is the best expectation of success you can achieve in this task, and
how should you behave to achieve it?

Page 2

Вам также может понравиться