Академический Документы
Профессиональный Документы
Культура Документы
Figure 1: Canonical matrix game social dilemmas. Left: Outcome variables R, P , S, and T are mapped to
cells of the game matrix. Right: The three canonical matrix game social dilemmas. By convention, a cell of
X, Y represents a utility of X to the row player and Y to the column player. In Chicken, agents may defect out
of greed. In Stag Hunt, agents may defect out of fear of a non-cooperative partner. In Prisoner’s Dilemma,
agents are motivated to defect out of both greed and fear simultaneously.
"∞ #
X
Vi~π (s0 ) = E~at ∼~π(O(st )),st+1 ∼T (st ,~at ) t
γ ri (st , ~at ) . (5)
t=0
should they be expected to capture the full range of human greater opportunity to learn to coordinate to jointly catch
individual differences in laboratory social dilemmas. Never- the prey.
theless, it is interesting to consider just how far one can go Possibly the most interesting effect on behavior comes
down this road of modeling Social Psychology hypotheses from the number of hidden units in the neural network be-
using such simple learning agents6 . Recall also that DQN hind the agents, which may be interpreted as their cogni-
is in the class of reinforcement learning algorithms that is tive capacity. Curves for tendency to defect are shown in
generally considered to be the leading candidate theory of the right column of Fig. 7, comparing two different network
animal habit-learning [50, 42]. Thus, the interpretation of sizes. For Gathering, an increase in network size leads to
our model is that it only addresses whatever part of coop- an increase in the agent’s tendency to defect, whereas for
erative behavior arises “by habit” as opposed to conscious Wolfpack the opposite is true: Greater network size leads to
deliberation. less defection.
Experimental manipulations of DQN parameters yield con- This can be explained as follows. In Gathering, defec-
sistent and interpretable effects on emergent social behavior. tion behavior is more complex and requires a larger network
Each plot in Fig. 7 shows the relevant social behavior metric, size to learn than cooperative behavior. This is the case
conflict for Gathering and lone-wolf behavior for Wolfpack, because defection requires the difficult task of targeting the
as a function of an environment parameter: Napple , Ntagged opposing agent with the beam whereas peacefully collecting
(Gathering) and rteam /rlone (Wolfpack). The figure shows apples is almost independent of the opposing agent’s behav-
that in both games, agents with greater discount parame- ior. In Wolfpack, cooperation behavior is more complex and
ter (less time discounting) more readily defect than agents requires a larger network size because the agents need to co-
that discount the future more steeply. For Gathering this ordinate their hunting behaviors to collect the team reward
likely occurs because the defection policy of tagging the whereas the lone-wolf behavior does not require coordina-
other player to temporarily remove them from the game only tion with the other agent and hence requires less network
provides a delayed reward in the form of the increased op- capacity.
portunity to collect apples without interference. However, Note that the qualitative difference in effects for network
when abundance is very high, even the agents with higher size supports our argument that the richer framework of
discount factors do not learn to defect. In such paradisia- SSDs is needed to capture important aspects of real social
cal settings, the apples respawn so quickly that an individ- dilemmas. This rather striking difference between Gathering
ual agent cannot collect them quickly enough. As a con- and Wolfpack is invisible to the purely matrix game based
sequence, there is no motivation to defect regardless of the MGSD-modeling. It only emerges when the different com-
temporal discount rate. Manipulating the size of the stored- plexities of cooperative or defecting behaviors, and hence the
and-constantly-refreshed batch of experience used to train difficulty of the corresponding learning problems is modeled
each DQN agent has the opposite effect on the emergence of in a sequential setup such as an SSD.
defection. Larger batch size translates into more experience
with the other agent’s policy. For Gathering, this means 6. DISCUSSION
that avoiding being tagged becomes easier. Evasive action
In the Wolfpack game, learning a defecting lone-wolf pol-
benefits more from extra experience than the ability to tar-
icy is easier than learning a cooperative pack-hunting pol-
get the other agent. For Wolfpack, larger batch size allows
icy. This is because the former does not require actions
6
The contrasting approach that seeks to build more struc- to be conditioned on the presence of a partner within the
ture into the reinforcement learning agents to enable more capture radius. In the Gathering game the situation is re-
interpretable experimental manipulations is also interesting versed. Cooperative policies are easier to learn since they
and complementary e.g., [24]. need only be concerned with apples and may not depend
Figure 6: Summary of matrix games discovered within Gathering (Left) and Wolfpack (Right) through
extracting empirical payoff matrices. The games are classified by social dilemma type indicated by color and
quandrant. With the x-axis representing fear = P − S and the y-axis representing greed = T − R, the lower
right quadrant contains Stag Hunt type games (green), the top left quadrant Chicken type games (blue),
and the top right quadrant Prisoner’s Dilemma type games (red). Non-SSD type games, which either violate
social dilemma condition (1) or do not exhibit fear or greed are shown as well.
on the rival player’s actions. However, optimally efficient Occam’s razor demands we prefer the simpler one. If SSDs
cooperative policies may still require such coordination to were just more realistic models that led to the same conclu-
prevent situations where both players simultaneously move sions as MGSDs then they would not be especially useful.
on the same apple. Cooperation and defection demand dif- This however, is not the case. We argue the implication of
fering levels of coordination for the two games. Wolfpack’s the results presented here is that standard evolutionary and
cooperative policy requires greater coordination than its de- learning-based approaches to modeling the trial and error
fecting policy. Gathering’s defection policy requires greater process through which societies converge on equilibria of so-
coordination (to successfully aim at the rival player). cial dilemmas are unable to address the following important
Both the Gathering and Wolfpack games contain embed- learning related phenomena.
ded MGSDs with prisoner’s dilemma-type payoffs. The MGSD
model thus regards them as structurally identical. Yet, 1. Learning which strategic decision to make, abstractly,
viewed as SSDs, they make rather different predictions. This whether to cooperate or defect, often occurs simulta-
suggests a new dimension on which to investigate classic neously with learning how to efficiently implement said
questions concerning the evolution of cooperation. For any decision.
to-be-modeled phenomenon, the question now arises: which
SSD is a better description of the game being played? If 2. It may be difficult to learn how to implement an effec-
Gathering is a better model, then we would expect coop- tive cooperation policy with a partner bent on defection—
eration to be the easier-to-learn “default” policy, probably or vice versa.
requiring less coordination. For situations where Wolfpack
3. Implementing effective cooperation or defection may
is the better model, defection is the easier-to-learn “default”
involve solving coordination subproblems, but there is
behavior and cooperation is the harder-to-learn policy re-
no guarantee this would occur, or that cooperation and
quiring greater coordination. These modeling choices are
defection would rely on coordination to the same ex-
somewhat orthogonal to the issue of assigning values to the
tent. In some strategic situations, cooperation may
various possible outcomes (the only degree of freedom in
require coordination, e.g., standing aside to allow a
MGSD-modeling), yet they make a large difference to the
partner’s passage through a narrow corridor while in
results.
others defection may require coordination e.g. block-
SSD models address similar research questions as MGSD
ing a rival from passing.
models, e.g. the evolution of cooperation. However, SSD
models are more realistic since they capture the sequential 4. Some strategic situations may allow for multiple dif-
structure of real-world social dilemmas. Of course, in mod- ferent implementations of cooperation, and each may
eling, greater verisimilitude is not automatically virtuous. require coordination to a greater or lesser extent. The
When choosing between two models of a given phenomenon, same goes for multiple implementations of defection.
0.12
0.12 0.11
0.10
_
0.10
Aggressiveness
0.10
0.09
+ +
0.08
0.08 0.08
0.07
0.06 0.06
0.06
Discount: 0.99 0.05
Batch size: 1e+04 Network size: 16
0.04 Discount: 0.995 Batch size: 1e+05 0.04 Network size: 64
0.04
Scarcity Scarcity Scarcity
0.70 0.75 0.70
0.68
Discount: 0.99 Batch size: 1e+04 Network size: 16
Discount: 0.995 Batch size: 1e+05 Network size: 64
Lone-wolf capture rate
0.70 0.65
0.66
_ _
0.60
0.64 0.65
+
0.62 0.55
0.60 0.60
0.50
0.58
0.55 0.45
0.56
0.54 0.50 0.40
Group benefit Group benefit Group benefit
Figure 7: Factors influencing the emergence of defecting policies. Top row: Gathering. Shown are plots
of average beam-use rate (aggressiveness) as a function of Napple (scarcity) Bottom row: Wolfpack. Shown
are plots of (two minus) average-wolves-per-capture (Lone-wolf capture rate) as a function of rteam (Group
Benefit). For both Gathering and Wolfpack we vary the following factors: temporal discount (left), batch size
(centre), and network size (right). Note that the effects of discount factor and batch size on the tendency to
defect point in the same direction for Gathering and Wolfpack, network size has the opposite effect (see text
for discussion.)