Вы находитесь на странице: 1из 39

LEARNING IN GAMES

A STARTING QUESTION

There exist various equilibrium concepts


Iterated elimination of dominated strategies Nash equilibrium and its refinements

Sub-game perfect equilibrium

A starting question for our investigation is

How do players reach the equilibrium theory predicts? How do they realize what to do?

AN IMPORTANT DISTINCTION

Games can be one-shot

Played only once

Or they can be repeated


Played many times Different possible rules (e.g. rematching among players or not)

Learning models apply to repeated games

Chance for learning

GOAL OF THIS LECTURE

Provide an overview of the most important learning models


Underlying assumptions How does learning take place Empirical validity

THE VERY BEGINNING

Evolutionary approaches
Assumption: each player is born with a strategy and uses it No proper learning: strategies that are more successful increase players fitness and enable the player to survive longer or reproduce more Empirical validity: unable to explain observed rapid learning in the lab, better suited to study animal behavior or cultural transmission

THE VERY BEGINNING

Evolutionary approaches
Assumption: each player is born with a strategy and uses it No proper learning: strategies that are more successful increase players fitness and enable the player to survive longer or reproduce more Empirical validity: unable to explain observed rapid learning in the lab, better suited to study animal behavior or cultural transmission

First attempt to look at the process behind strategy selection

LEARNING FROM OTHERS

Belief learning models


Way of learning the way to the Nash equilibrium Players update beliefs about what others will do based on history, and use those beliefs to decide which strategies are best Assumption: opponents are playing stationary (possibly mixed) strategies

Fictitious play (1951)


keep track of relative frequency of play for each strategy of the other player. begin with an initial or prior sample use play frequencies as beliefs to compute expected value of alternative strategies

play a best response to historical frequencies

LEARNING FROM OTHERS

Brown(1951) first introduced FP


Name due to fact that Brown wanted to use it as an explanation for Nash equilibrium: he imagined that a player would "simulate" play of the game in his mind and update his future play based on this simulation Not proper learning but brought attention to the process of reaching an equilibrium

Problem with classic fictitious play

It ignores when another player made choices in the past Introduces a weight on past observation that increases according to how far in the past it was collected

Weighted fictitious play

WEIGHTED FICTITIOUS PLAY

Fictitious play beliefs

Weight given to the observation of strategy aj


Belief player i has that opponent will play strategy aj at time t+1 Indicator function =1 if strategy aj played at time t

WEIGHTED FICTITIOUS PLAY IN ACTION

In the given game

ISSUES WITH FICTITIOUS PLAY

Empirical evidence does not support use of FP

Nyarko and Schotter, 2002 Look at (stated) beliefs and actual play

Beliefs elicited in an incentive-compatible way Behavior is not the one predicted by fictitious play Given stated beliefs, FP makes predictions that are not empirically observed

It seems people are not very good at predicting others behavior People dont learn from others choices!

THE FOCUS SHIFTS ON OWN EXPERIENCE

Some basic rules


Law of effect (Thornidke 1898)

Choices that have led to good outcomes are more likely to be repeated in the future

Power law of practice (Blackburn 1936)

Learning curves tend to be steep initially, then flatter

THE FOCUS SHIFTS ON OWN EXPERIENCE

Reinforcement models
Assumption: strategies are reinforced by their previous payoff (and somewhat by neighbouring ones) Learning: update attraction you feel for a particular strategy cumulating past payoffs (including spillover) Empirical validity: best for describing players with imperfect reasoning ability (animals)

Does not account in any way historical or foregone payoffs in the learning process

AN EXAMPLE OF REINFORCEMENT LEARNING


Define qnk the propensity of player n to play pure strategy k Assume that initially players have the same propensity to play all their pure strategies Choice rule is probabilistic

qnk (t ) pnk (t ) qnj (t )

Reinforcement function R(X)= x- xmin

AN EXAMPLE OF REINFORCEMENT LEARNING


Define qnk the propensity of player n to play pure strategy k Assume that initially players have the same propensity to play all their pure strategies Choice rule is probabilistic

qnk (t ) pnk (t ) qnj (t )

x payoff of chosen move xmin is the smallest available payoff

Reinforcement function R(X)= x- xmin

AN EXAMPLE OF REINFORCEMENT LEARNING


Define qnk the propensity of player n to play pure strategy k Assume that initially players have the same propensity to play all their pure strategies Choice rule is probabilistic

qnk (t ) pnk (t ) qnj (t )

x payoff of chosen move xmin is the smallest available payoff

Reinforcement function R(X)= x- xmin


Negative reinforcement is avoided

AN EXAMPLE OF REINFORCEMENT LEARNING

Cumulative update of propensities


qnj(t)+R(x) if j=k

qnk(t+1) =
qnk

Corresponding update of probabilities

pnj (t )

q
jS

qnj (t )
nj

(t )

AN EXAMPLE OF REINFORCEMENT LEARNING

Cumulative update of propensities


qnj(t)+R(x) if j=k If plays again same strategy

qnk(t+1) =
qnk

propensity towards that strategy

n is the player k is the strategy

Corresponding update of probabilities qnj (t ) pnj (t ) qnj (t )


jS

AN EXAMPLE OF REINFORCEMENT LEARNING

Consider the following game structure


h
H L 9, 7 5,3

l
11,2 4,6

Lets assume that


qrh=qrl=10 and qch=qcl=10 players play each strategy with the following probability

prh=prl=10/(10+10)=0.5 pch=pcl=10/(10+10)=0.5

at t=1 (H, l) is determined

AN EXAMPLE OF REINFORCEMENT LEARNING

Recall
qrh=qrl=10 and qch=qcl=10 and all p=0.5 at t=1 (H, l) is determined

h H L 9, 7 5,3

l 11,2 4,6

Reinforcement learning implies


qrh(t+1)=10+(11-4)=17 qrl=10 qcl(t+1)=10+(2-2)=10 qch=10

AN EXAMPLE OF REINFORCEMENT LEARNING

Recall
qrh=qrl=10 and qch=qcl=10 and all p=0.5 at t=1 (H, l) is determined

h H L 9, 7 5,3

l 11,2 4,6

Reinforcement learning implies


qrh(t+1)=10+(11-4)=17 qrl=10 qcl(t+1)=10+(2-2)=10 qch=10

column gets the min he could get, no negative reinforcement, at max NO reinforcement

AN EXAMPLE OF REINFORCEMENT LEARNING

Recall
qrh=qrl=10 and qch=qcl=10 and all p=0.5 at t=1 (H, l) is determined

h H L 9, 7 5,3

l 11,2 4,6

Reinforcement learning implies


qrh(t+1)=10+(11-4)=17 qrl=10 qcl(t+1)=10+(2-2)=10 qch=10

Probabilities are modified: law of effect!


prh(t+1)=17/(10+17)= 0.63 prl(t+1)= 10/(10+17)= 0.37 pch(t+1)=10/20=0.5 pcl(t+1)=10/20=0.5

AN EXAMPLE OF REINFORCEMENT LEARNING

Recall
qrh=qrl=10 and qch=qcl=10 and all p=0.5 at t=1 (H, l) is determined

h H L 9, 7 5,3

l 11,2 4,6

Reinforcement learning implies


qrh(t+1)=10+(11-4)=17 qrl=10 qcl(t+1)=10+(2-2)=10 qch=10

Probabilities are modified: law of effect!


prh(t+1)=17/(10+17)= 0.63 prl(t+1)= 10/(10+17)= 0.37 pch(t+1)=10/20=0.5 pcl(t+1)=10/20=0.5

No reinforcement implies no modification in probabilities

EMPIRICAL EVIDENCE ON REINFORCEMENT

There are more complicated version of the RL model which include also an experimentation and a forgetting parameter Experimental studies (Erev, Roth 1995, 1998) show that
RL from own payoff is able to approximate the direction of learning in some games only learning is slow

Players do not learn from counterfactual reasoning

options never explored never teach anything!

COMBINING EXPERIENCE AND BELIEFS

Belief learning assumes players ignore information about what they chose in the past

look at what others did!

Reinforcement learning assumes players ignore information about foregone payoffs

never learn anything from choices not taken

New wave of models combining both types of information

EXPERIENCE-WEIGHTED ATTRACTION MODEL

Main features

attraction of a strategy depends not only experienced payoff (when actually played) , but also on counterfactual payoff (payoff earned when actually not played).

Two variables updated after every period


attractions Aji (t) experience weight N(t)

EWA IN ACTION

Attraction of a given strategy is determined by

EWA IN ACTION

Attraction of a given strategy is determined by

weight of past attractions weighted payoff term

EWA: WEIGHT OF PAST ATTRACTIONS

where

is a forgetting parameter (for past attractions) N(t) is the experience weight updated by the following formula, so that it is weakly increasing

controls how attractions grow indicates whether experience is cumulated rather than averaged =1 cumulated, =0 averaged

EWA: WEIGHTED PAYOFF TERM


payoff of playing a given strategy

EWA: WEIGHTED PAYOFF TERM


payoff of playing a given strategy

This term represent reinforcement and implies that

all strategies are updated by times the payoff they would have yielded, even when not chosen the chosen strategy is further updated by (1- )
when =0 only experience payoff matter

represents the weight placed on foregone payoffs

FROM ATTRACTIONS TO PROBABILITIES

Attractions are translated into probabilities (of playing the different strategies by the following logit response rule

Portrays noisy players that are not always able to pick the best option
For low levels of , many discrimination mistakes, play strategies at random For high levels of , better discrimination between attractions indicates the amount of noise in behavior

EWA PREDICTIONS AND OTHER MODELS

EWA is able to reproduce results of other models


weight on foregone payoffs forgetting parameter for past attractions indicates if experience is cumulated rather than averaged

THE EWA CUBE AND OTHER MODELS

EWA ACCURACY IN PREDICTION

The EWA model is able to corrctly predict choice across many games, especially when we look at out-of sample behavior

We use have of the observation to tune the model and use the tuned model to predict the rest of the data

Example: Continental divide game


Decision where to locate your firm on a scale from 1-14 (1 is closer to Hollywood, 14 closer to silicon Valley, assuming you make a product that applies to both sectors) Players payoff depend on own choice and the median choice

CONTINENTAL DIVIDE GAME

Two Pareto-ranked Nash equilibria Convergence to one or the other depends on the starting point

EWA VS OTHER MODELS

All models capture bifurcating behavior but


Problems with reinforcement learning: No braking/acceleration (Too slow!)

Problems with belief learning: No asymmetric convergence (No effects of payoff differences in equilibria)

EWA model is the best at describing behavior in the continental divide game

A NEW VERSION OF EWA

Technically EWA has four parameters


, , , But with four parameters you can fit pretty much anything!

Criciticism: maybe EWA overfits the data


Self-tuning

EWA: new version of EWA with fewer parameters (=0, , and endogenously determined)

Performance of the self-tuning EWA is pretty much the same as the standard EWA

READINGS
Elective reading Colin Camerer and Teck-Hua Ho, ExperiencedWeighted Attraction Learning in Normal Form Games Econometrica, Vol. 67, No. 4 (Jul., 1999), pp. 827-874, http://www.jstor.org/stable/2999459

Mandatory reading Selected pages from Camerer, Behavioral Game Theory ch. 6 (199-204, 209-221, 257) Extract from Camerer, Chong, Ho (handed out in class)