Вы находитесь на странице: 1из 12

a r t ic l e s

Nicotinic receptors in the ventral tegmental area


promote uncertainty-seeking
Jrmie Naud13, Stefania Tolu13, Malou Dongelmans13, Nicolas Torquet13, Sbastien Valverde13,
Guillaume Rodriguez13, Stphanie Pons4, Uwe Maskos4, Alexandre Mourot13, Fabio Marti13 & Philippe Faure13

Cholinergic neurotransmission affects decision-making, notably through the modulation of perceptual processing in the cortex.
In addition, acetylcholine acts on value-based decisions through as yet unknown mechanisms. We found that nicotinic acetylcholine
receptors (nAChRs) expressed in the ventral tegmental area (VTA) are involved in the translation of expected uncertainty into
motivational value. We developed a multi-armed bandit task for mice with three locations, each associated with a different reward
2016 Nature America, Inc. All rights reserved.

probability. We found that mice lacking the nAChR b2 subunit showed less uncertainty-seeking than their wild-type counterparts.
Using model-based analysis, we found that reward uncertainty motivated wild-type mice, but not mice lacking the nAChR b2 subunit.
Selective re-expression of the b2 subunit in the VTA was sufficient to restore spontaneous bursting activity in dopamine neurons and
uncertainty-seeking. Our results reveal an unanticipated role for subcortical nAChRs in motivation induced by expected uncertainty
and provide a parsimonious account for a wealth of behaviors related to nAChRs in the VTA expressing the b2 subunit.

Acetylcholine (ACh) has a well-studied role in arousal, learning source16,17. Exploration occurs when an animal actively gathers
and attention1,2 and modulates perceptual decision-making, notably information about alternative choices with the aim of reducing
through its influence over prefrontal cortices 3. Decisions are not the uncertainty level on the consequences of possible actions 1821.
only driven by sensory information, but also by the animals expec- This typically happens in a learning setting when the statistics
tation of the values associated with alternative choices4,5. ACh also of an outcome given a specific action, or its uncertainty, are
affects cost-benefit decision-making6,7, albeit through unknown in the process of being estimated. Once the consequences of
mechanisms. Notably, effects on value-based decisions induced by possible actions have been estimated, the animal can use this
pharmacological manipulations of ACh or dopamine (DA) often knowledge of the environment to exploit reward sources efficiently.
mirror each other 5. Systemic pharmacological manipulation of However, when the outcome of an action is probabilistic, uncer-
either DA or ACh receptors affects the choices between alterna- tainty remains as to what will be the outcome of an action every
tives associated with different delays, costs or risk57. Disentangling time it is performed. This known variability of the outcome of an
the respective implications of ACh and DA in decision-making is of action, as in a repeated lottery, is referred to as expected uncertainty
npg

utmost interest, as psychological diseases such as tobacco addiction or reward risk22,23.


or schizophrenia involve alterations of both decision-making and The motivation to perform an action can be modulated by expected
ACh-DA interactions2,8. uncertainty and lead to uncertainty-seeking or risk-taking. In ani-
By opposition to ACh, DA exerts a well-defined role in moti- mals, it is challenging to distinguish between a motivation to explore
vation and reinforcement 9. DA neurons encode reward prediction or exploit a probabilistic reward source, as it cannot be easily inferred
errors as bursts of action potentials 9,10. These bursts may be used whether animals might still try to reduce expected uncertainty by
as a teaching signal to learn the value of actions 11 or as an incen- exploring19,24,25 or whether they are attracted by this known-unknown
tive signal biasing the ongoing decisions12. The bursting activity and thus exploit. Nevertheless, the influence of expected uncertainty on
of DA neurons from the VTA is influenced by ACh, notably motivational value is experimentally tractable. It has been proposed that
through nicotinic acetylcholine receptors containing the 2 subunit expected uncertainty may be signaled by ACh22 in the context of per-
(2*-nAChRs)2,1315. Thus, the similarity between the effects of DA ceptual decision-making, but this theory has never been connected to
and ACh on decision-making may arise from a nicotinic regula- an involvement of DA in value-based decision-making. Moreover, the
tion of the VTA. We hypothesized that endogenous ACh, released neural basis underlying the motivation given to choices associated with
from mesopontine nuclei to the VTA2,5,15, may be involved in expected uncertainty is not known. We computationally characterized
value-based decisions. the influence of VTA 2*-nAChRs on seeking probabilistic rewards in
In the context of decision-making, the concept of exploration a multi-armed bandit task for mice and found that these receptors are
is opposed to that of exploitation with regard to a known reward involved in translating expected uncertainty into motivational value.

1Sorbonne Universits, UPMC University Paris 06, Institut de Biologie Paris Seine, UM 119, Paris, France. 2CNRS, UMR 8246, Neuroscience Paris Seine, Paris,

France. 3INSERM, U1130, Neuroscience Paris Seine, Paris, France. 4Institut Pasteur, CNRS UMR 3571, Unit NISC, Paris, France. Correspondence should be
addressed to P.F. (phfaure@gmail.com).

Received 9 October 2015; accepted 9 December 2015; published online 18 January 2016; doi:10.1038/nn.4223

nature NEUROSCIENCE VOLUME 19 | NUMBER 3 | MARCH 2016 471


a r t ic l e s

Figure 1 Decisions under uncertainty in a mouse bandit task using a ICSS b Naive
intracranial self-stimulations. (a) Illustration of the spatial multi-armed
bandit task design. Three explicit square locations were placed in the
open field (0.8-m diameter), forming an equilateral triangle (50-cm
10 cm
side). Mice received an intracranial self-stimulation each time they were
detected in the area of one of the rewarding locations. Animals, which
could not receive two consecutive stimulations at the same location, pA
alternated between rewarding locations. (b) Trajectories of one mouse
pB
(5 min) before (left) and after (middle) learning in the CS and US (right). pC
(c) Time to goal (average duration from the last location to the goal) in the
US as a function of the reward probability of the goal. Inset, times to goal
were identical for the three locations in the CS (F(2,18) = 0.53, P = 0.59,
CS US
one-way ANOVA). Insert, individual curves. N = 19 mice. (d) Traveled
distance between two consecutive locations. In the US, WT mice traveled c 6 US d 80
120
more distance when going toward less probable ICSS reward. Light gray, 100
individual curves. (e) Instantaneous speed: in the US, the maximal speed n = 19
75
80
5

Traveled distance (cm)


of WT mice did not depend on the expected probability of the reward, 60

Time to goal (s)


70
contrary to what was observed in the DS with increasing intensity. Data 40
4 CS 1/4 1/2 1
are presented as mean s.e.m. Time 0 corresponds to the last time of 65
n.s.
ICSS delivery or omission. (f) Proportion of choices of the three rewarding
locations as a function of reward probability in the US. Light gray, 3 3 60 US
individual curves. Inset, proportion of choices were identical for the 2
55 n = 19
three locations in the CS (F(2,18) = 0.16, P = 0.86, one-way ANOVA). 2 1
**
Error bars represent mean s.e.m. *P < 0.05, **P < 0.01, ***P < 0.001. **
50
2016 Nature America, Inc. All rights reserved.

n.s., not significant at P > 0.05. 1/4 1/2 1 1 1 1 1/4 1/2 1


ICSS probability ICSS probability

RESULTS e 40 US f 45 *** n = 19
P(ICSS)
Mice-adapted multi-armed bandit task based on ICSS 35 US
Instantaneous speed (cm s )
1

40
100%
In uncertain environments, living beings have to decide when 30
50%
to exploit known resources and when to explore alternatives.

Repartition (%)
n.s. 35
25 25%
This exploitation-exploration dilemma is often studied in the 20 30 US
multi-armed bandit task 16,18, in which humans choose between 15
40
%
n.s.
25
different slot machines to discover the richest option. To assess the 10 30
implication of nAChRs in decision-making under uncertainty, we 20
5 20
designed a spatial version of the bandit task adapted to mice. Studies 1 1 1
0 15
of animal choices often rely on food restriction, even though the 0 1 2 3 4 5 6 7 8 9 10 1/4 1/2 1
satiation level is known to affect decisions under uncertainty 26. Time from ICSS (s) ICSS probability

To circumvent this issue, we trained mice to perform a sequence ICSS


of choices in an open-field in which three locations were explicitly
associated with intra-cranial self-stimulation (ICSS) rewards27,28 reward probability affected the extent of locomotion, reflecting the
(Fig. 1a and Online Methods). Mice could not receive two con- tendency to explore the open field between two visits.
secutive ICSS at the same location. Consequently, they alternated In addition, mice distributed their choices of ICSS according to
npg

between rewarding locations by performing a sequence of choices. the reward probability associated to each location. As expected,
Mice mostly went directly to the next rewarding location, but some- in the CS, mice treated each rewarding location the same way
times wandered around in the open field before reaching the goal (Fig. 1f). In the US, however, mice visited the locations associated
(Fig. 1b). At each location, mice had to choose which next reward- with higher ICSS probability more often (F(2,18) = 113, P < 0.001,
ing location to go to (amongst the two alternatives) and how directly one-way ANOVA; Fig. 1f). Because mice could not receive two con-
they should get there. secutive ICSSs, the repartition on the rewarding locations (Fig. 1f)
We compared the behavior of wild-type (WT) mice under two set- arose from a sequence of binary choices in three gambles (G1, G2,
tings of ICSS delivery: a certain setting (CS) in which all locations G3) between two respective payoffs (here, G1 = {100 versus 50%},
were associated with a given ICSS, and an uncertain setting (US), in G2 = {50 versus 25%}, G3 = {100 versus 25%}; Fig. 2a,b). For each
which each location was associated with a different probability of ICSS gamble, mice chose the optimal location (associated with the highest
delivery (Fig. 1a). Although trajectories in the CS were stereotyped, probability of reward; Fig. 2b) more than 50% of the time, but less
reward uncertainty induced a markedly different behavioral pattern than 100% of the time. When they had to choose between a certain
in the US (Fig. 1b). The time to goal was identical for the three loca- (100%) and an uncertain (50%) ICSS, mice displayed a low prefer-
tions in the CS (Fig. 1c), but was greater for locations associated with ence (56%) for the optimal location, suggesting a positive inclination
lower reward probabilities in the US (F(2,18) = 6.8, P = 0.002, one-way toward reward uncertainty (Fig. 2a)29,30.
ANOVA; Fig. 1c). More precisely, the reward probability of the goal
affected the traveled distance (F(2,18) = 7.3, P = 0.002; Fig. 1d), but not A positive motivational value to expected uncertainty
the traveling speed (F(2,18) = 0.48, P = 0.62; Fig. 1e) or the dwell times In standard rodent decision tasks in which there is only a single
(Supplementary Fig. 1a). This contrasts with the effect of reward choice, the relative influence of expected value and uncertainty on
intensity, which affected the speed profiles between two rewarding choices is difficult to dissect, as both parameters vary with reward
locations (Supplementary Fig. 1b). Thus, in this setup, reward inten- probability. For binary outcomes (the choice is rewarded or not),
sity affected the invigoration of goal-directed movements, whereas the expected mean reward corresponds to the reward probability p,

472 VOLUME 19 | NUMBER 3 | MARCH 2016 nature NEUROSCIENCE


a r t ic l e s

Figure 2 Model-based analysis of decisions a Transition model b Transition model


shows motivation for expected uncertainty. 80 75
Data Models
(a) Illustration of the modeling of the task. P(At|Ct-1) = f(pA,pB) Data
Top, transition model of animal choices. 70
Each rewarding location is modeled as a 70

Predicted transition (%)


P(At|Ct-1) transition (%)
A Uncertainty
state, labeled {A,B,C}. The probability 65
of transition from one state to another Gamble 60
depends on the reward probabilities of 60 -greedy
B C
the two available options. Middle, expected
reward and uncertainty as a function of 50 55 Softmax
reward probabilities (curves). In the three
gambles, the differences in expected values 50
40 Chance level

Expected uncertainty
(0.5 in G1, 0.25 in G2, 0.75 in G3) and 1 0.25

Expected reward
0.8 0.2 45
expected uncertainties (0.25 in G1, 0.0625 G1 G2 G3

. 5 00 1

. 2 50 2

. 2 00 3
vs (1 ble

vs ( G

vs (1 G
in G2, 0.1875 in G3) are distinct. Bottom, 0.6 0.15

0% %

5% %

5% %
am

)
0.4 0.1
model of locomotion. The time to goal

G
0.2 0.05
depends on both reward history (whether
the mouse received a reward in the previous
0
0 0.2 0.4 0.6 0.8 1
0
cTime-to-goal model

location or not) and reward expectation Reward probability Data Models 4


4.5
at the goal. (b) Left, proportions of exploitative
choices (choice of the most valuable Time-to-goal model 3
Rewarded (on
4
alternative, that is, with the highest previous trial)

Time to goal (s)


TTG(A) = f(Rt-1,pA)
2
probability of reward in a given gamble)
3.5
of the mice, for the three gambles. Dots, 1
2016 Nature America, Inc. All rights reserved.

individual data points. Right, predicted A


3
transition of the three decision models (lines) Rt-1 0
corresponding to the experimental data t-1 2.5 1
(dots, same value as in the right panel). B C
Non-rewarded
Error bars represent mean s.e.m. (c) Left,
2 2
time to goal (experimental data and model fit, 1/4 1/2 3/4 1 T0 TR TE T
mean s.e.m.) as a function of reward ICSS probability Regression coefficients
probability of the goal and reward history.
Data merged from experiments with different sets of reward probability. Right, regression coefficients from the best-fitting model of locomotion,
corresponding to a constant (T0) and the dependencies on reward history (TR), expected reward (TE) and expected uncertainty (T).

whereas expected uncertainty is related to reward variance, p(1 p) depicts the randomness in choices, whereas the uncertainty-seeking
(Fig. 2a). Expected uncertainty is zero for predictable outcomes parameter represents the value given to expected uncertainty. The
(100% or 0% probability) and maximal at 50% probability (the most positive uncertainty bonus ( = 1.01 0.24, mean s.e.m.) explains
unpredictable outcome). In our setup, the difference in expected the great attractiveness of the 50% choice in G1 by a powerful motiva-
uncertainty and value between the outcomes was distinct for each tion induced by its expected uncertainty. We assessed the robustness
of the three gambles (Fig. 2a), which provides enough constraints of the data and of the model by fitting four sets of probabilities, with
to differentiate between the influence of two co-varying parameters multiple different differences of expected reward and uncertainties
(reward mean and variance). We compared computational models of (Supplementary Fig. 3), and compared alternative models (match-
npg

decision-making16,31 (Online Methods), each representing a different ing law33 and uncertainty-normalized temperature34; Supplementary
influence of expected reward and uncertainty on choices, to assess Fig. 2). Overall, we found that expected uncertainty positively biased
which model best explained the experimental data (Supplementary the choices in WT mice.
Fig. 2). In the epsilon-greedy model, animals always choose the best As stated above, two types of decisions are nested in the task: the
option, minus a fixed probability. In this model, the choices for the sequence of choices (which goal?) and the locomotion (how to reach
optimal reward are identical whatever the gamble is (Fig. 2b), which the goal?). To investigate the influence of uncertainty on the latter,
did not correspond to the experimental data. In the softmax model, we performed multiple linear regressions of time to goal. Comparison
choices depend on the difference between the expected rewards of of linear models (BIC; Online Methods and Supplementary Fig. 2)
the two alternatives. The softmax model formalizes that the larger the revealed that the time to goal depended on the reward probability of
difference in rewards is, the higher the probability to select the best the goal, but not on the alternative (the location not chosen in the
option will be. This model predicts that the proportions of optimal gamble). These observations suggest a dual-stage process in which
choices would be sorted in the following order {G2 < G1 < G3}, dif- animals first choose which location to go to and then how to reach it.
fering from what was found experimentally {G1 < G2 < G3}. Finally, Furthermore, the dependence on reward history (TR = 0.49 0.21,
in the uncertainty model, decision is biased toward actions with the mean s.e.m.) suggests that when mice had just gotten rewarded, they
most uncertain consequences by assigning a bonus value32 to their traveled further in the open field (Fig. 2c). We also found that the
expected uncertainties19,21,24,29. This last model accurately reproduced time to goal was decreased by the expected reward (TE = 1.63 0.16;
the pattern of mice preferences (Fig. 2b) and best accounted for our Fig. 2c) and by the expected uncertainty (T = 1.56 0.33). This
experimental data (Supplementary Fig. 2), as shown by model com- suggests that expected uncertainty increased motivation to go straight
parison (likelihood penalized for the number of parameters, Bayesian toward the rewarding goal. Thus, model-based analyses suggest that,
Information Criterion (BIC); Online Methods). Furthermore, the two in the two decision problems (which location and how to get
parameters of the uncertainty bonus model disentangle two deter- there), mice assign a positive motivational value ( and T) to the
minants of decision-making: the inverse temperature parameter expected uncertainty of the goal.

nature NEUROSCIENCE VOLUME 19 | NUMBER 3 | MARCH 2016 473


a r t ic l e s

Figure 3 2*-nAChRs in the VTA affect a d e


WT (n = 95) 2KO (26) 2VEC (70)
choices and locomotion. (a) Behavioral 2KO 2VEC WT 1.0
trajectories after learning in the US for 2KO n.s.
(red) and 2VEC (blue) mice. (b) Proportion of n.s.
**

Cumulative density
***
choices of the three rewarding locations plotted * **
2KO 3 15
as a function of reward probability in the US for 10 cm

Frequency Hz
the WT (black), 2KO (red, n = 11) and 2VEC

%SWB
(blue, n = 12) mice. Insets, individual curves
for the 2KO (top, red) and 2VEC (bottom, 45
b 50
2VEC
2KO
blue) mice. (c) Time to goal (in seconds) as a 40 0 0
0

1 mv
function of reward probability of the goal for the

Repartition (%)
30

Repartition (%)
n.s. 0 100
WT (black), 2KO (red) and 2VEC (blue) mice. 1s
20 %SWB
Insets, individual curves for the 2KO (top, red) 30 50
and 2VEC (bottom, blue) mice. (d) Examples ***
40
2VEC 2KO +Nic
f 2VEC
of in vivo juxtacellular recordings of the firing WT (n = 19)

1 mv
KO(11) 30
pattern of DA neurons from anesthetized WT VEC(12)
(black), 2KO (red) and 2VEC (blue) mice. 15
1/4 1/2 1
20
1/4 1/2 1 10 s
(e) Cumulative distribution of percent of spikes
in a burst (%SWB). Insets, mean frequency 4.5 c 7 g h
WT (n = 46) 2KO (26) 2VEC (70)
(left) and %SWB (right) of VTA DA neurons from ** 5 2KO
80
Time to goal (s)

the three genotypes (obtained from 22 WT mice, 4 TH


n.s. 3

Time to goal (s)


300 GFP
13 2KO mice and 13 2VEC mice). (f) Typical 60

Max. %SWB
Max. freq
3.5 1
electrophysiological recording illustrating the
***

7
40
effect of intravenous injection of nicotine on the 5 2VEC ***
2016 Nature America, Inc. All rights reserved.

firing pattern of DA neurons in 2KO (red) and 3 n.s. *** 20 ***


3
2VEC (blue) mice. Dots, individual data points. n.s.
100
n.s.
*** VTA 50 m
0
(g) Relative variation in firing frequency (left) 2.5
1/4 1/2 1
1
1/4 1/2 1
S.N. S.N.
and absolute variation in %SWB of DA neurons ICSS probability
S. = Saline, N. = Nicotine
from the three genotypes (obtained from 14
WT mice, 13 2KO mice and 13 2VEC mice) in response to nicotine. Error bars represent mean s.e.m. (h) Coronal sections of the VTA showing
the site of lentivirus injection revealed that 2-eGFP colocalized with TH, a dopaminergic marker. Transduction of 2-eGFP virus was efficient in both
dopaminergic and non-dopaminergic cells. Dots, individual data points. *P < 0.05, **P < 0.01, ***P < 0.001. n.s., not significant at P > 0.05.

VTA b2*-nAChRs are involved in motivation by uncertainty We next tested whether 2*-nAChRs could affect motivation
In the ICSS bandit task, WT mice displayed a robust preference for by expected uncertainty by acting on VTA DA neurons, which are
uncertain outcomes. Thus, mice estimate expected uncertainty to important for value-based decision-making9,12. Extracellular in vivo
direct their decisions and locomotion29,30. The suggested role of single-unit recordings in anesthetized animals (Fig. 3d) confirmed
ACh in signaling expected uncertainty22 prompted us to investigate that, when compared with those of WT mice, DA neurons from
whether nAChRs are involved in uncertainty-driven motivation. 2KO mice displayed a decreased firing frequency (2.1 versus 3.2 Hz,
We used mice in which the 2 subunit, the most abundant nicotinic T(74) = 2.4, P < 0.001, Welch t test), lacked bursting activity (U = 1,637,
subunit in the brain1,2, was deleted (2KO mice), in our ICSS bandit P = 0.002, Mann-Whitney test; Fig. 3e) and did not respond to a sys-
task. In the CS, 2KO mice (2KO and 2GFP (2KO mice injected temic injection of nicotine (104.6 1.34% from baseline frequency,
with a lentivirus expressing just eGFP); Online Methods) learned the V = 103, P = 0.95, Wilcoxon test; Fig. 3f,g and Supplementary
npg

task and responded to different ICSS current intensities similarly to Fig. 5e,f)13,14. If 2*-nAChRs underlie uncertainty-driven motivation
WT mice (Supplementary Fig. 4), confirming the modest implication in the VTA, then restoring expression of these receptors in the VTA of
of nAChRs in decision-making with certain rewards35,36. In contrast, 2KO mice should restore both the sensitivity to expected uncertainty
in the US (Fig. 3a), 2KO mice systematically chose the location and DA activity. We achieved selective re-expression of the 2 subunit
associated with the highest uncertainty level (that is, 50% probability) in the VTA of 2KO mice (2VEC) using a lentiviral vector13 strategy
to a lower extent than WT mice (T(28) = 5.4, P < 0.001, unpaired (Online Methods). Coronal sections revealed that viral re-expression
t test; Fig. 3b). Furthermore, the relationship between time to goal was restricted to the VTA (Fig. 3h and Supplementary Fig. 5ad).
and reward probability of the goal (F(2,10) = 0.33 P = 0.72, one-way DA cells from 2VEC mice displayed a spontaneous firing frequency
ANOVA; Fig. 3c and Supplementary Fig. 4) was abolished in 2KO (T(156) = 1.6, P = 0.1, unpaired t test) and bursting activity (U = 3288,
animals. These results suggest a role for 2*-nAChRs in decision- P = 0.9, Mann-Whitney test) similar to those observed in WT mice,
making under uncertainty. and responded to nicotine (120.2 4.78% from baseline frequency,

Table 1 Behavioral measures and model parameters in the uncertain setting


WT 2KO 2VEC
(mean s.e.m.) (mean s.e.m.) (mean s.e.m.) ANOVA 2VEC versus 2KO 2VEC versus WT
Repartition at P = 1/2 (Fig. 3b) 35.6 0.5% 31.3 0.5% 34.8 0.7% F(2,39) = 13.45, P < 0.001 T(21) = 3.86, P < 0.001 T(29) = 0.96, P = 0.35
Gamble 1 (Fig. 4a) 55.6 2.2% 69.1 3.5% 54.1 3.7% F(2,39) = 6.95, P = 0.002 T(21) = 3.04, P = 0.006 T(29) = 0.30, P = 0.77
Uncertainty-seeking parameter (Fig. 4b) 1.01 0.24 0.38 0.47 1.21 0.23 F(2,39) = 6.89, P = 0.003 T(21) = 3.1, P = 0.005 T(29) = 0.6, P = 0.56
Inverse temperature parameter (Fig. 4b) 1.57 0.16 1.14 0.28 1.21 0.16 F(2,39) = 1.6, P = 0.22
Reward history coefficient (Fig. 4e) 0.49 0.12 0.52 0.11 0.26 0.16 F(2,39) = 1.6, P = 0.22
Reward expectation coefficient (Fig. 4e) 1.63 0.16 0.02 0.17 1.21 0.23 F(2,39) = 18.4, P < 0.001 T(21) = 4.0, P < 0.001 T(29) = 1.55, P = 0.13
Uncertainty expectation 1.56 0.33 0.58 0.37 0.88 0.28 F(2,39) = 9.8, P = 0.001 T(21) = 3.2, P = 0.005 T(29) = 1.42, P = 0.17
coefficient (Fig. 4e)

474 VOLUME 19 | NUMBER 3 | MARCH 2016 nature NEUROSCIENCE


a r t ic l e s

Figure 4 Model-based analysis reveals n.s. n.s. n.s. a Transition in G1 b


a role for VTA 2-nAChR in uncertainty-driven 90 2.5 0.75
motivation. (a) Transition (proportions of ** ** 2
80 n.s. 0.7

Uncertainty-seeking ()
exploitative choices) in the three gambles, 1.5

Transition (%)
for the WT (black), 2KO (red) and 2VEC 70
1 n.s. 0.65
(blue) mice. Dots, individual data points. 60 0.5 0.6

**
(b) Value of the parameters ( and )

**
50 0
derived from the model-based analysis n.s. 0.55
WT 0.5
(uncertainty model) of the transition 40
KO 0.5
functions for the WT (black), 2KO (red) 30 1
VEC n.s.
and 2VEC (blue) mice. The color code 1.5 0.45
G1 G2 G3 0 0.5 1 1.5 2 2.5
indicates the predicted transition in (100% vs. 50%) (50% vs. 25%) 100% vs. 25%) Exploration parameter ()
gamble 1 (100 versus 50% reward
probability) as a function of the parameters 4.5 Rewarded (on
2KO
4.5 c 2VEC
5
WT
d e
of the model. (c,d) Time to goal as a previous trial) 4
4 4 2 KO
function of reward probability of the goal 3 2 VEC

Time to goal (s)

Time to goal (s)

Parameter fits
3.5 3.5
and reward history for 2KO (c) and 2VEC (d) 2
n.s.
mice. Experimental data (black dots with 3 3 1
n.s.
error bars) and model fit (stripes) 2.5 2.5
0
are displayed as mean s.e.m. Data are Non-rewarded 1
2 Data 2
***
merged from experiments with four sets 2
***
n.s.
of reward probabilities. (e) Regression 1.5
1/4 1/2 3/4 1
1.5
1/4 1/2 3/4 1
3
T0 TR TE T
coefficients from the best-fitting model ICSS probability ICSS probability
of locomotion, corresponding to a
2016 Nature America, Inc. All rights reserved.

constant (T0) and the dependencies on reward history (TR), expected reward (TE) and expected uncertainty (T), for the WT (black), 2KO (red) and
2VEC (blue) mice. Data are presented as mean s.e.m. *P < 0.05, **P < 0.01, ***P < 0.001. n.s., not significant at P > 0.05.

V = 960, P < 0.001, Wilcoxon test), suggesting that, as previously estab- We next used the model-based analysis to characterize the role of
lished13,14, physiological functions were also restored. Notably, 2VEC VTA 2*-nAChRs in decision-making. Transition functions of 2KO
mice differed from 2KO animals, but not from WT mice (Table 1), and WT mice differed in particular in G1 (100 versus 50%, T(28) = 3.54,
when analyzing the uncertainty-related choices (Fig. 3b) and the times P = 0.001, unpaired t test; Fig. 4a), suggesting an alteration of deci-
to goal (Fig. 3c and Supplementary Fig. 4), indicating a restoration of sions under uncertainty. Indeed, the behavior of 2KO mice was best
the WT phenotype following re-expression of 2 in the VTA. explained (Supplementary Fig. 6) either by the softmax model or
the uncertainty model in which the sensitivity to uncertainty was
a st
1 session 2
nd
session 3rd session null on average (T(11) = 0.8, P = 0.44, t test; Fig. 4b). Both models
point toward the same interpretation: 2*-nAChRs are necessary for
translating uncertainty signals into motivational value. Accordingly,
uncertainty-seeking was significantly different in 2KO and WT mice
WT 2KO (T(29) = 2.9, P = 0.007, unpaired t test). Notably, the model-based
analysis supports the conclusion that 2*-nAChRs selectively
re-expressed in the VTA restored the positive value of expected uncer-
tainty (Table 1 and Supplementary Fig. 7). Moreover, the analysis
npg

of the trajectories in-between goals indicates that neither expected


b c d reward nor expected uncertainty of the next goal influenced the time
60 60 85
Data WT Data 2KO
Efficacy (reward/choice) (%)

* to goal in 2KO mice, whereas both parameters affected time to goal


50 n = 13 50 n = 11 80
in 2VEC mice (Table 1 and Fig. 4ce). Mice from each genotype
40 40
all traveled more distance when the previous trial was rewarded,
Choice (%)

75
30 30 compared to when it was not (F(2,39) = 0.02, P = 0.98). Together
70
20 20 with the transition model, where the temperature parameter was
10 10 65 not significantly different between genotypes (F(2,39) = 1.6, P = 0.2,
0 0 60
WT 2KO Figure 5 2*-nAChRs affect decision-making under uncertainty in
st
1 2nd 3
rd
1
st
2
nd
3
rd

session session a dynamical foraging task. (a) Top, illustration of the task design.
During each session, animals receive stimulations in two (of three)
e f g potential locations, with the two rewarding locations (indicated by an
60 60 6
Model WT Model 2KO R in the colored circle) changing between sessions. Bottom, behavioral
trajectories in the three 2-min sessions for the WT (black) and 2KO
Uncertainty-seeking ()

50 50 4 *
40 40 2
(red) mice. (b,c) Repartition (in %) on the three locations (color-coded
Choice (%)

as in a) for the three sessions. Calculation is divided by half-session


30 30 0 durations for the WT (b) and 2KO (c) mice. (d) Proportion of rewarded
20 20 2 choices averaged on three sessions for WT (black) and 2KO (red) mice.
Dots, individual data points. (e,f) Model fits of the experimental data
10 10 4
shown in b and c. (g) Uncertainty-seeking parameter (that is, value
0 0 6 given to uncertainty) of the models for the WT (black) and 2KO (red).
WT 2KO
1
st
2nd 3
rd
1
st
2
nd
3
rd Dots, individual data points. Data are presented as mean s.e.m.
session session *P < 0.05, **P < 0.01, ***P < 0.001.

nature NEUROSCIENCE VOLUME 19 | NUMBER 3 | MARCH 2016 475


a r t ic l e s

Figure 6 New interpretation of behaviors related to VTA nAChRs using


the uncertainty model. (a) Spatial learning task40. In a cross maze,
a b
Final (absorbing) state : R = 1
the north arm contained a reachable food reward and the south arm N. Reachable
food S4
contained an unreachable food. The initial position of the animal was Start Start
variable (east or west). (b) Discretized representation of the task. S1S4 W. E.
Initial S1 S2
Initial
state state
are the four possible states in the task; R = {0,1} indicates whether the
food reward was attained or not. Arrows represent the possible transitions Unreachable S3
between the states. Data adapted with permission from ref. 40. S. food
Uncertain state
(c) Simulation (stripes) of the time to reach the food (data: lines with error
bars) along the learning sessions for the WT (black) and 2KO (red) mice. c 50 WT d 80
70
Parameters were = 0.54, = 7.75, = 1.51 for WT mice, = 0.59, 2KO = 2.5

Time to reward (s)


40

Time to reward (s)


Data 60
= 7.93, = 0.04 for 2KO mice. (d) Effect of the uncertainty-seeking
30 50
parameter in the simulation of the time to reach the food. (e) Passive Model
40
avoidance task41 consisting in a single training trial in which the mouse
20 30
was delivered a foot shock upon entrance in the dark chamber. Data 20
10
adapted with permission from ref. 41. (f) Simulation (stripes) of retention 10 = 0.5
latencies (data: lines with error bars) in response to various intensities 0 0
1 2 3 4 5 6 1 2 3 4 5 6
of foot shock. Parameters were = 1.81, = 0.53, = 1.62 for WT, Session number Session number
= 1.62, = 0.15, = 1.65 for 2KO). Error bars represent mean s.e.m.
e f 300
WT
Training 250 2KO

Retention latency (s)


one-way ANOVA; Fig. 4b), the time-to-goal model strongly suggests =
200
Data
that 2*-nAChRs do not affect the global motivation to explore, but Foot shock: / (mA) Model
Test 150
rather specifically affect expected uncertainty on choices (which
2016 Nature America, Inc. All rights reserved.

=
= I
goal) and locomotion (how to reach it). Return 100

V latency (s) 50
=
b2*-nAChRs and uncertainty-seeking in a dynamic environment
2 f(V,) 0
0 0.5 1 1.5 2 2.5
Having characterized the role of 2*-nAChRs in motivation by expected Dark Light Footshock intensity (mA)
uncertainty at steady state, we next asked whether our results could be
extended to unstable environments. We analyzed the behavior of WT CS task (Fig. 4 and Supplementary Fig. 8). Moreover, model com-
and 2KO mice during the learning sessions of the CS (Supplementary parison (Supplementary Fig. 9) suggested that experimental data
Fig. 8a,b), when reward probabilities were not known yet, and mod- was not better explained by indirect effects arising from learning,
eled it with reinforcement-learning (RL) models16,29,31,37,38 (Online that is, asymmetric (different for reward and omission38) or adaptive
Methods and Supplementary Fig. 8cf). In the standard RL model, (uncertainty-dependent) learning rates37. In summary, our results and
animals learn the expected value of the three rewarding locations using models support the idea that, in WT mice, expected uncertainty exerts
reward prediction errors (the difference between actual reward and a direct motivational effect. By contrast, the behavior of 2KO mice
predicted value)31. In the model, animals use these values to select the could be explained by either the standard RL model or the expected
next action using a softmax decision rule. We extended the standard uncertainty model (Fig. 5f and Supplementary Fig. 8b,d,f). In this
RL model to uncertainty learning. Animals can use reward prediction latter model, the uncertainty-seeking parameter in the 2KO mice
errors to estimate reward uncertainty21,23,24,39: the larger the errors was significantly lower than that of WT mice (T(22) = 2.4, P = 0.027,
(positive or negative) are, the more uncertain the outcomes will be. unpaired t test; Fig. 5g) and not significantly different from zero
This uncertainty RL model best explained the behavior of WT mice (T(22) = 0.6, P = 0.54). These results provide further evidence that
npg

(Supplementary Fig. 8c,e). By contrast, the behavior of 2KO mice 2*-nAChRs are involved in uncertainty-seeking.
was best accounted for by a standard RL model, that is, without uncer-
tainty bonus (Supplementary Fig. 8d,f). Uncertainty-seeking in other nAChRs-related behaviors
To further test the importance of 2*-nAChRs for translat- Finally, using computational approaches, we assessed whether the
ing expected uncertainty into motivational value, we compared role of VTA 2*-nAChRs in uncertainty-seeking might pervade other
the behavior of WT and 2KO mice in a dynamic setting (DS) in decisions about natural rewards, punishments and salient aspects of
which the locations delivering the ICSS reward changed over time the environment. Paradoxically, it has been found that mice lacking
(Online Methods). In the DS, mice underwent three consecu- the 2 subunit perform seemingly better than WT mice, displaying
tive sessions in which only two of the three locations delivered the improved spatial learning40 and passive avoidance41. The spatial
ICSS. Overall, WT and 2KO mice adapted their strategies to these learning test consists of a maze with a reachable food reward at one of
changes in reward contingencies (Fig. 5a). Starting from a random the four arms and an unreachable food at the opposite arm (Fig. 6a).
strategy, both WT and 2KO mice learned the position of the two We simulated this task with a RL model embedding uncertainty-
rewarding locations in the first session (Fig. 5b,c). However, 2KO seeking (Fig. 6b and Online Methods). The model fitted the behavior
mice persevered in their earlier choices throughout the changes in of both strains, as the time to reach the food was greater for WT
outcomes (Fig. 5b,c), resulting in slightly fewer rewarded choices (with an uncertainty bonus) than for 2KO mice (without bonus)
than for WT mice (T(22) = 2.7, P = 0.01, unpaired t test; Fig. 5d). in the early trials40 (Fig. 6c). This slowly decreasing time to reward
Model comparison (RL models; Supplementary Fig. 9 and Online progressively emerges in RL models embedding uncertainty-seeking
Methods) suggested that an uncertainty bonus model best explained (Fig. 6d), but cannot be easily explained in terms of differences in
the behavior of WT mice (Supplementary Fig. 9). This uncertainty initial value (novelty-seeking31,32), learning rates or a combination
model reproduced the choices of WT animals during the changes in of both (Supplementary Fig. 10). Hence, interest for the unreach-
rewarding outcomes (Fig. 5e), with a positive bonus given to uncer- able reward may arise in WT mice from uncertainty, integrated at
tainty ( = 2.18 0.77). This is consistent with the results in the the level of the VTA. We also assessed whether the same explanation

476 VOLUME 19 | NUMBER 3 | MARCH 2016 nature NEUROSCIENCE


a r t ic l e s

holds in the case of punishment. We simulated the passive avoid- an intrinsic reinforcement signal24 (or an intrinsic incentive) for
ance task, where animals were in a box divided in two compartments which gathering information would be self-satisfactory, helping the
(light and dark). 2KO mice avoided the dark compartment, which animal to better predict its environment19,45. ACh is closely related
was associated with a foot shock, for a longer time than WT mice41 to information processing1,2. We found that the cholinergic control of
(Fig. 6e). Uncertainty-seeking can also explain this difference, as the DA could underpin the motivational properties of information. This
foot shock induces a single negative prediction error, which results in finding could explain the observed similarities when ACh or DA are
uncertainty (Fig. 6f). Expected uncertainty may in that case motivate pharmacologically manipulated during value-based decisions5. Several
WT mice, but not 2KO mice, to explore the dark part of the box in mechanisms may underlie functional ACh-DA interactions in the
spite of potential negative consequences. Finally, these models can be brain. Mesopontine ACh might directly signal expected uncertainty
extended to neutral, but potentially uncertain, outcomes. The deficits (2 in our model), as proposed for forebrain ACh 22. Alternatively,
of 2KO mice in locomotion in an open-field without rewards13,42 our data suggest a contribution of 2*-nAChRs to the spontaneous
can be understood as a lack of uncertainty-seeking (Supplementary excitability of DA neurons, with anesthetized 2KO animals lacking
Fig. 11ae). Exploration in the open-field is composed of action pat- bursting of DA neurons14. In this case, cholinergic signaling onto the
terns related to information-seeking (scanning, rearing and sniff- VTA via 2*-nAChRs could serve as a permissive gate15, rendering
ing42,43). The apparent lack of object recognition observed in 2KO DA neurons more responsive (that is, affecting ) to uncertainty sig-
mice40 can also be interpreted as a lack of curiosity for the objects, that nals generated elsewhere in the mesocorticolimbic loop23. A strong
is, an absence of uncertainty-seeking (Supplementary Fig. 11f,g), prediction of these interpretations would be that ACh is implicated in
rather than a memory deficit. The uncertainty-seeking model not only the encoding of expected uncertainty by DA neurons25. Nevertheless,
generalize our results to positive, aversive and neutral natural out- we cannot totally exclude, with our lentiviral strategy, downstream
comes, but also provides a parsimonious interpretation for a wealth adaptations in 2KO mice or an effect at the level of axon terminals,
of behaviors associated with 2*-nAChRs13,4042. where 2*-nAChRs also influence the transfer function between DA
2016 Nature America, Inc. All rights reserved.

firing and release in the striatum47.


DISCUSSION Nicotine hijacks endogenous cholinergic signaling by exerting its
Our findings reveal a role for VTA 2*-nAChRs in translating reinforcing effects through 2*-nAChRs in the VTA13. But nicotine
expected uncertainty into motivational value and suggest that this also affects decisions that are not related to nicotine intake itself.
receptor is involved in exploratory decisions. Three broad theoretical Smokers are actually known to display alterations of the exploration-
types of exploration have been proposed. At one extreme, exploration exploitation tradeoff48 and of risk-sensitivity49. Notably, tobacco
is seen as randomness or noise in the choices (as in the softmax or addiction and pathological gambling, which can be seen as exces-
epsilon greedy models), which is problematic, as rodents, similar to sive uncertainty-seeking25,30, display a high comorbidity8. Thus,
primates, display curiosity and refined forms of exploration29,42,43. At we suggest that 2*-nAChRs in the VTA, in addition to mediating
the other extreme, a dichotomy has been postulated between subcorti- reinforcement to nicotine, constitute a key neural component in the
cal systems (such as the VTA) and frontal cortices. Frontal cortices alterations of decision-making under uncertainty observed in nico-
would mediate flexible exploration by overriding16,17 the influence tine addicts48,49.
of exploitive value, underpinned by DA neurons9,11. Our results lie in
between these two extremes and are consistent with theoretical work Methods
on optimal exploration18,20,21 and intrinsic motivation19,24. In this Methods and any associated references are available in the online
view, exploration and exploitation are entangled: uncertainty is given version of the paper.
a value that can be compared to and added to the value of primary
Note: Any Supplementary Information and Source Data files are available in the
rewards18,20,21,32. Our findings further suggest that motivation driven
npg

online version of the paper.


by expected uncertainty may be sufficient to explain exploration in
unstable environments. This contrasts with neuroeconomics stud- Acknowledgments
We thank E. Guigon for discussions, C. Prvost-Soli for technical support, and
ies, where expected uncertainty is defined as risk and corresponds J.-P. Changeux, E. Ey, G. Dugu and A. Boo for comments on the manuscript.
to the exploitation of the irreducible variability of the outcomes, This work was supported by the Centre National de la Recherche Scientifique
whereas exploration is driven only by reducible uncertainty23,44. CNRS UMR 8246, the University Pierre et Marie Curie (Programme Emergence
However, assigning a given choice to exploration or exploitation is 2012 for J.N. and P.F.), the Agence Nationale pour la Recherche (ANR Programme
Blanc 2012 for P.F., ANR JCJC to A.M.), the Neuropole de Recherche Francilien
tricky in non-human animals. It relies on phenomenological models
(NeRF) of Ile de France, the Foundation for Medical Research (FRM, Equipe FRM
of behavior in the absence of direct reports of decision strategies. DEQ2013326488 to P.F.), the Bettencourt Schueller Foundation (Coup dElan
Thus, our data clearly show that VTA 2*-nAChRs affect motivation 2012 to P.F.), the Ecole des Neurosciences de Paris (ENP) to P.F., the Fondation
from expected uncertainty in both stable and unstable environments, pour la Recherche sur le Cerveau (FRC et les rotariens de France, espoir en
but whether this corresponds to motivation to explore, exploit or tte 2012) to P.F. and the Brain & Behavior Research Foundation for a NARSAD
Young Investigator Grant to A.M. The laboratories of P.F. and U.M. are part of the
both remains unclear. Nevertheless, our results are consistent with a cole des Neurosciences de Paris Ile-de-France RTRA network. P.F. and U.M. are
causal role for the VTA in decisions under uncertainty via a common members of the Laboratory of Excellence, LabEx Bio-Psy, and P.F. is member of the
currency (a motivational metric) that integrates at least the values of DHU Pepsy.
both expected reward and expected uncertainty20,21,25,32.
AUTHOR CONTRIBUTIONS
DA neurons not only encode reward prediction errors 9,10, but J.N. and P.F. designed the study. S.T. and J.N. performed the virus injections. M.D.,
also surprise45, risk25 (that is, expected uncertainty) and resolution N.T., G.R. and J.N. performed the behavioral experiments. S.V. and F.M. performed
of uncertainty46, which are all linked to information. DA neuron the electrophysiological recordings. S.P. and U.M. provided the genetic tools.
bursting related to reward is thought to constitute a teaching signal J.N., F.M. and P.F. analyzed the data. J.N., A.M. and P.F. wrote the manuscript.
for actions11 or an incentive signal12 biasing the ongoing behavior. COMPETING FINANCIAL INTERESTS
DA activity not related to expected rewards per se could also act as The authors declare no competing financial interests.

nature NEUROSCIENCE VOLUME 19 | NUMBER 3 | MARCH 2016 477


a r t ic l e s

Reprints and permissions information is available online at http://www.nature.com/ 26. Schuck-Paim, C., Pompilio, L. & Kacelnik, A. State-dependent decisions cause
reprints/index.html. apparent violations of rationality in animal choice. PLoS Biol. 2, e402 (2004).
27. Carlezon, W.A. Jr. & Chartoff, E.H. Intracranial self-stimulation (ICSS) in rodents
to study the neurobiology of motivation. Nat. Protoc. 2, 29872995 (2007).
1. Everitt, B.J. & Robbins, T.W. Central cholinergic systems and cognition. Annu. Rev. 28. Kobayashi, T., Nishijo, H., Fukuda, M., Bure, J. & Ono, T. Task-dependent
Psychol. 48, 649684 (1997). representations in rat hippocampal place neurons. J. Neurophysiol. 78, 597613
2. Dani, J.A. & Bertrand, D. Nicotinic acetylcholine receptors and nicotinic cholinergic (1997).
mechanisms of the central nervous system. Annu. Rev. Pharmacol. Toxicol. 47, 29. Funamizu, A., Ito, M., Doya, K., Kanzaki, R. & Takahashi, H. Uncertainty in action-
699729 (2007). value estimation affects both action choice and learning rate of the choice behaviors
3. Guillem, K. et al. Nicotinic acetylcholine receptor 2 subunits in the medial of rats. Eur. J. Neurosci. 35, 11801189 (2012).
prefrontal cortex control attention. Science 333, 888891 (2011). 30. Anselme, P., Robinson, M.J.F. & Berridge, K.C. Reward uncertainty enhances incentive
4. Rangel, A., Camerer, C. & Montague, P.R. A framework for studying the neurobiology salience attribution as sign-tracking. Behav. Brain Res. 238, 5361 (2013).
of value-based decision making. Nat. Rev. Neurosci. 9, 545556 (2008). 31. Sutton, R.S. & Barto, A.G. Reinforcement Learning: an introduction (MIT Press,
5. Fobbs, W.C. & Mizumori, S.J. Cost-benefit decision circuitry: proposed modulatory 1998).
role for acetylcholine. Prog. Mol. Biol. Transl. Sci. 122, 233261 (2014). 32. Kakade, S. & Dayan, P. Dopamine: generalization and bonuses. Neural Netw. 15,
6. Kolokotroni, K.Z., Rodgers, R.J. & Harrison, A.A. Acute nicotine increases both 549559 (2002).
impulsive choice and behavioral disinhibition in rats. Psychopharmacology (Berl.) 33. Herrnstein, R.J. Relative and absolute strength of response as a function of
217, 455473 (2011). frequency of reinforcement. J. Exp. Anal. Behav. 4, 267272 (1961).
7. Mendez, I.A., Gilbert, R.J., Bizon, J.L. & Setlow, B. Effects of acute administration 34. Ishii, S., Yoshida, W. & Yoshimoto, J. Control of exploitation-exploration meta-
of nicotinic and muscarinic cholinergic agonists and antagonists on performance parameter in reinforcement learning. Neural Netw. 15, 665687 (2002).
in different cost-benefit decision making tasks in rats. Psychopharmacology (Berl.) 35. Yeomans, J. & Baptista, M. Both nicotinic and muscarinic receptors in ventral
224, 489499 (2012). tegmental area contribute to brain-stimulation reward. Pharmacol. Biochem. Behav.
8. McGrath, D.S. & Barrett, S.P. The comorbidity of tobacco smoking and gambling: 57, 915921 (1997).
a review of the literature. Drug Alcohol Rev. 28, 676681 (2009). 36. Serreau, P., Chabout, J., Suarez, S.V., Naud, J. & Granon, S. Beta2-containing
9. Schultz, W. Multiple dopamine functions at different time courses. Annu. Rev. neuronal nicotinic receptors as major actors in the flexible choice between conflicting
Neurosci. 30, 259288 (2007). motivations. Behav. Brain Res. 225, 151159 (2011).
10. Waelti, P., Dickinson, A. & Schultz, W. Dopamine responses comply with basic 37. Krugel, L.K., Biele, G., Mohr, P.N., Li, S.-C. & Heekeren, H.R. Genetic variation in
assumptions of formal learning theory. Nature 412, 4348 (2001). dopaminergic neuromodulation influences the ability to rapidly and flexibly adapt
11. Montague, P.R., Dayan, P. & Sejnowski, T.J. A framework for mesencephalic
2016 Nature America, Inc. All rights reserved.

decisions. Proc. Natl. Acad. Sci. USA 106, 1795117956 (2009).


dopamine systems based on predictive Hebbian learning. J. Neurosci. 16, 38. Niv, Y., Edlund, J.A., Dayan, P. & ODoherty, J.P. Neural prediction errors reveal a
19361947 (1996). risk-sensitive reinforcement-learning process in the human brain. J. Neurosci. 32,
12. Berridge, K.C. From prediction error to incentive salience: mesolimbic computation 551562 (2012).
of reward motivation. Eur. J. Neurosci. 35, 11241143 (2012). 39. Balasubramani, P.P., Chakravarthy, V.S., Ravindran, B. & Moustafa, A.A. An extended
13. Maskos, U. et al. Nicotine reinforcement and cognition restored by targeted reinforcement learning model of basal ganglia to understand the contributions of
expression of nicotinic receptors. Nature 436, 103107 (2005). serotonin and dopamine in risk-based decision making, reward prediction, and
14. Mameli-Engvall, M. et al. Hierarchical control of dopamine neuron-firing patterns punishment learning. Front. Comput. Neurosci. 8, 47 (2014).
by nicotinic receptors. Neuron 50, 911921 (2006). 40. Granon, S., Faure, P. & Changeux, J.-P. Executive and social behaviors under nicotinic
15. Grace, A.A., Floresco, S.B., Goto, Y. & Lodge, D.J. Regulation of firing of receptor regulation. Proc. Natl. Acad. Sci. USA 100, 95969601 (2003).
dopaminergic neurons and control of goal-directed behaviors. Trends Neurosci. 30, 41. Picciotto, M.R. et al. Abnormal avoidance learning in mice lacking functional high-
220227 (2007). affinity nicotine receptor in the brain. Nature 374, 6567 (1995).
16. Daw, N.D., ODoherty, J.P., Dayan, P., Seymour, B. & Dolan, R.J. Cortical substrates 42. Maubourguet, N., Lesne, A., Changeux, J.-P., Maskos, U. & Faure, P. Behavioral
for exploratory decisions in humans. Nature 441, 876879 (2006). sequence analysis reveals a novel role for 2* nicotinic receptors in exploration.
17. Frank, M.J., Doll, B.B., Oas-Terpstra, J. & Moreno, F. Prefrontal and striatal PLoS Comput. Biol. 4, e1000229 (2008).
dopaminergic genes predict individual differences in exploration and exploitation. 43. Gordon, G., Fonio, E. & Ahissar, E. Emergent exploration via novelty management.
Nat. Neurosci. 12, 10621068 (2009). J. Neurosci. 34, 1264612661 (2014).
18. Gittins, J.C. & Jones, D.M. A dynamic allocation index for the discounted multiarmed 44. Payzan-LeNestour, E. & Bossaerts, P. Risk, unexpected uncertainty and estimation
bandit problem. Biometrika 66, 561565 (1979). uncertainty: Bayesian learning in unstable settings. PLoS Comput. Biol. 7,
19. Scott, P.D. & Markovitch, S. Learning novel domains through curiosity and e1001048 (2011).
conjecture. IJCAI (US) 1, 669674 (1989). 45. Redgrave, P. & Gurney, K. The short-latency dopamine signal: a role in discovering
20. Kaelbling, L.P. Learning in Embedded Systems (MIT Press, 1993). novel actions? Nat. Rev. Neurosci. 7, 967975 (2006).
21. Meuleau, N. & Bourgine, P. Exploration of multi-state environments: Local measures 46. Bromberg-Martin, E.S. & Hikosaka, O. Midbrain dopamine neurons signal preference
and back-propagation of uncertainty. Mach. Learn. 35, 117154 (1999). for advance information about upcoming rewards. Neuron 63, 119126 (2009).
22. Yu, A.J. & Dayan, P. Uncertainty, neuromodulation, and attention. Neuron 46, 47. Rice, M.E. & Cragg, S.J. Nicotine amplifies reward-related dopamine signals in
npg

681692 (2005). striatum. Nat. Neurosci. 7, 583584 (2004).


23. Bach, D.R. & Dolan, R.J. Knowing how much you dont know: a neural organization 48. Addicott, M.A., Pearson, J.M., Wilson, J., Platt, M.L. & McClernon, F.J. Smoking
of uncertainty estimates. Nat. Rev. Neurosci. 13, 572586 (2012). and the bandit: a preliminary study of smoker and nonsmoker differences in
24. Oudeyer, P.-Y. & Kaplan, F. What is intrinsic motivation? A typology of computational exploratory behavior measured with a multiarmed bandit task. Exp. Clin.
approaches. Front. Neurorobot. 1, 6 (2007). Psychopharmacol. 21, 6673 (2013).
25. Fiorillo, C.D., Tobler, P.N. & Schultz, W. Discrete coding of reward probability and 49. Galvn, A. et al. Greater risk sensitivity of dorsolateral prefrontal cortex in young
uncertainty by dopamine neurons. Science 299, 18981902 (2003). smokers than in nonsmokers. Psychopharmacology (Berl.) 229, 345355 (2013).

478 VOLUME 19 | NUMBER 3 | MARCH 2016 nature NEUROSCIENCE


ONLINE METHODS interval >160 ms (ref. 51). Firing rate and %SWB were evaluated on successive
Animals. 40 male C57BL/6J (WT) mice and 47 male knockout SOPF HO ACNB2 windows of 60 s, with a 45-s overlapping period14. For each cell, firing frequency
(2KO) mice obtained from Charles Rivers Laboratories France were used. 2KO was rescaled as a percentage of its baseline value averaged during the 2 min before
mice were generated as described previously41. WT and 2KO mice are not nicotine injection. The effect of nicotine was assessed as a comparison between
littermates and this could be a potential caveat of the study. However, mutant the maximum of variation of firing rate and %SWB observed during the first
mice were generated almost 20 years ago, the line has been backcrossed more 3 min after saline and nicotine injection. The results are presented as mean s.e.m.
than 20 generations with the wild-type C57BL6/J line, and the 2KO line was of the difference of maximal variation before and after nicotine.
confirmed to be at more than 99.99% C57BL/6J. Mice arrived to the animal
facility at 8 weeks of age, and were housed individually for at least 2 weeks before Fluorescence immunohistochemistry. Following the death of all the lentivirus-
the electrode implantation. Behavioral tasks started 1 week after implantation to injected mice (GFP and VEC animals), brains were rapidly removed and
insure full recovery. Intracranial self-stimulation (ICSS) does not require food fixed in 4% paraformaldehyde. Following a period of at least 3 d of fixation
deprivation; as a consequence all mice had ad libitum access to food and water at 4 C, serial 60-m sections were cut from the midbrain with vibratome.
except during behavioral sessions. The temperature (2022 C) and humidity Immunohistochemistry was performed as follows: free-floating VTA brain
was automatically controlled and a circadian light cycle of 12/12-h light-dark sections were incubated 1 h at 4 C in a blocking solution of phosphate-
cycle (lights on at 8:30 a.m.) was maintained in the animal facility. All experi- buffered saline (PBS) containing 3% Bovine Serum Albumin (BSA, Sigma; A4503)
ments were performed during the light cycle, between 09:00 a.m. and 5:00 p.m. (vol/vol) and 0.2% Triton X-100 (vol/vol) and then incubated overnight at 4 C
Experiments were conducted at Universit Pierre et Marie Curie. All procedures with a mouse anti-tyrosine hydroxylase antibody (TH, Sigma, T1299) at 1:200
were performed in accordance with the recommendations for animal experi- dilution and a rabbit anti-GFP antibody (Molecular Probes, A-6455) at 1:5,000
ments issued by the European Commission directives 219/1990 and 220/1990, dilution in PBS containing 1.5% BSA and 0.2% Triton X-100. The following day,
and approved by Universit Pierre et Marie Curie. sections were rinsed with PBS and then incubated 3 h at 2225 C with Cy3-
conjugated anti-mouse and Cy2-conjugated anti-rabbit secondary antibodies
Stereotaxic injection of lentivirus. The lentiviral expression vectors 2 subunit (Jackson ImmunoResearch, 715-165-150 and 711-225-152) at 1:200 dilution in
IRES-eGFP cDNAs and the eGFP cDNA (control) are under the control of the a solution of 1.5% BSA in PBS. After three rinses in PBS, slices were wet-mounted
2016 Nature America, Inc. All rights reserved.

ubiquitous mouse phosphoglycerate kinase (PGK) promoter. Further details using Prolong Gold Antifade Reagent (Invitrogen, P36930). Microscopy was car-
can be found in ref. 13. 2KO mice aged 8 weeks were anesthetized using ried out with a fluorescent microscope, and images captured using a camera and
isoflurane. The mouse was introduced into a stereotaxic frame adapted for ImageJ imaging software.
mice. Lentivirus (2 l at 75 ng of p24 protein per l) was injected bilaterally In the case of electrophysiological recordings, an immmunohistochemical
at: anteroposterior = 3.4 mm, mediolateral = 0.5 mm from bregma and identification of the recorded neurons was performed as described above, with the
dorsoventral = 4.4 mm from the surface for VTA injection. Mice were implanted addition of 1:200 AMCA-conjugated Streptavidin (Jackson ImmunoResearch) in
with electrodes 45 weeks after viral injection. At the end of the behavioral the solution. Neurons labeled for both TH and neurobiotin in the VTA50 allowed
experiments, lentiviral re-expression in the VTA was verified using fluorescence to confirm their neurochemical phenotype.
immunohistochemistry. As a control for 2VEC mice, another group of 2KO
mice were injected with lentivirus expressing eGFP only. We did not observe any Electrode implantation and ICSS training. Mice were introduced into a
difference between 2KO (without lentiviral injections, n = 6) and 2-eGFP mice stereotaxic frame and implanted unilaterally with bipolar stimulating electrodes
(n = 6) in either choices (P = 0.76, unpaired t test) and time-to-goal (P = 0.34, for ICSS in the medial forebrain bundle27,28 (MFB, anteroposterior = 1.4 mm,
unpaired t test). We thus pooled the data from both groups to serve as control mediolateral = 1.2 mm, from the bregma, and dorsoventral = 4.8 mm from the
for 2VEC data. dura). After recovery from surgery (1 week), the efficacy of electrical stimulation
was verified in an open field with an explicit square location (side = 1 cm) at its
In vivo electrophysiological recordings. Extracellular recording electrodes were center. Each time a mouse was detected in the area (D = 3 cm) of the location,
constructed from borosilicate glass tubing (1.5 mm O.D. / 1.17 mm I.D.) using a 200-ms train of 20 0.5-ms biphasic square waves pulsed at 100 Hz was gener-
a vertical electrode puller (Narishige). Tip was broken and electrodes were filled ated by a stimulator28. Mice self-stimulating at least 50 times in a 5-min session
with a 0.5% sodium acetate solution (wt/vol) and 1.5% neurobiotin (wt/vol), were kept for the behavioral sessions (3 mice were excluded at this stage, due
npg

yielding impedances of 6 9 M. to improper electrode implantation). In the certain setting (see below), ICSS
Animals were anesthetized with chloral hydrate (400 mg per kg of body intensity was adjusted so that mice self-stimulated between 50 and 150 times per
weight, intraperitoneal, supplemented as required to maintain optimal anesthe- session at the end of the training (ninth and tenth session). Current intensity was
sia throughout the experiment), and placed in a stereotaxic apparatus (Kopf subsequently maintained the same throughout the uncertainty setting.
Instruments). The left saphenous vein was catheterized for intravenous admin-
istration of nicotine and the right saphenous vein was catheterized for intravenous Behavioral data acquisition. Decision-making and locomotor activity were
administration of saline solution (NaCl 0.9%, wt/vol). The electrophysiological recorded in a 1-m diameter circular open-field. Experiments were performed
activity was sampled in the central region of the VTA (coordinates: 3.13.5 mm using a video camera, connected to a video-track system, out of sight of the
posterior to Bregma, 0.30.6 mm lateral to midline and 44.7 mm below the experimenter. A home-made software (Labview National instrument) tracked the
brain surface)50. Spontaneously active DAergic neurons were identified on the animal, recorded its trajectory (20 frames per s) for 5 min and sent TTL pulses
basis of previously established electrophysiological criteria: (1) a typical triphasic to the ICSS stimulator when appropriate (see below).
action potential with a marked negative deflection; (2) a characteristic long dura-
tion (>2.0 ms); (3) an action potential width from start to negative trough >1.1 ms; Markovian decision problem by ICSS conditioning. We considered two com-
(4) a slow firing rate (between 1 and 10 Hz) with an irregular single spiking pat- plementary aspects of motivation: direction and locomotion of the mice. We thus
tern and occasional short, slow bursting activity51. At least 5 min of spontaneous developed a protocol allowing to record simultaneously the sequential choices
baseline electrophysiological activity was recorded before intravenous injection between differently rewarding locations (that is, associated with intracranial self-
of nicotine (30 g per kg). At the end of the recording period, the neurons were stimulation) and the locomotor activity of the mice in between these locations.
stimulated by application of positive currents steps to electroporate neurobiotin After validation of ICSS behavior27, conditioning tasks took place in the 0.8-m
into the neurons to allow DA neurons identification. diameter circular open-field. Three explicit square locations were placed in the
open field, forming an equilateral triangle (side = 50 cm). Each time a mouse
Analysis of electrophysiological data. DA cell firing was analyzed with respect to was detected in the area of one of the rewarding locations, a stimulation train
the average firing rate and the percentage of spikes within bursts (%SWB, number was delivered. Animals received stimulations only when they alternate between
of spikes within bursts, divided by total number of spikes). Bursts were identified rewarding locations. In separate experiments, the intensity or the probability of
as discrete events consisting of a sequence of spikes such that: their onset is defined stimulation delivery differed for the three rewarding locations. Precise param-
by two consecutive spikes within an interval <80 ms and they terminated with an eters (for example, reward probabilities) were pseudo-randomly assigned to each

doi:10.1038/nn.4223 nature NEUROSCIENCE


rewarding location for each mouse. For each set of (consecutive) experiments con- the two remaining locations. Accordingly, we modeled decisions between two
ditioning consisted in one daily session of 5 min, during 10 d. Decision-making alternatives. We considered five choice rules31: local matching law33 Herrnstein,
was analyzed by expressing data as a series of choices between rewarding softmax, epsilon-greedy, uncertainty bonus19,21,39 and uncertainty-controlled
locations (labeled A, B, C). We only considered choices made in an interval of randomness34.
10s after visiting the previous rewarding location. This restriction is based on
the observation that choices made after 10 s were random (that is, uniformly In the local matching law, the probability to choose an action i (amongst two
distributed) for every condition, and thus probably reflect a disengagement from rewarding location) is given by
the task. This led to the exclusion of fewer than 3% of the total choices made by
the animals (all groups), which suggests that incorporating these late choices Vi
Pi = (1)
would not significantly change the results. This game implements a Markovian jVj
Decision Process (MDP31) consisting of three states (A, B, C), corresponding to
each rewarding locations, and a transition function, corresponding to the propor- where Vi is defined as the value of an option, that is, the expected reward
tions of choices in the three gambles. The repartition is defined as the proportion (see below).
of states visited by the animal during a session. The transition matrix describes The epsilon-greedy choice rule is
the proportion of transitions from one state to another. Because animals receive
stimulations only when they alternate between rewarding locations, there is no Pi = 1 e i = argmax(Vi ) (2)
e
repetition of states in the sequence and the 3 3 transition matrix has null diago- otherwise
nal elements. The training consisted of a block (10 daily sessions of 5 min) in a CS
where all locations were associated with an ICSS delivery. The test phase consisted where is the probability of choosing less valuable options, reflecting undirected
of a block (10 daily sessions of 5 min) assessing choice organization under an exploration.
US, by associating each location with a different probability of ICSS, a validated The softmax choice rule is
protocol for studying risky choices52. The foraging phase was performed after
the uncertain setting, and five supplementary sessions of deterministic setting. exp(bVi )
Pi = (3)
j exp(bV j )
2016 Nature America, Inc. All rights reserved.

The foraging phase assessed the exploratory strategy in a dynamic setting (DS),
which consisted in three consecutive 5-min sessions. In each session, two out of
three locations delivered the ICSS reward, and the identity of the two rewarding where is an inverse temperature parameter reflecting the sensitivity of choice
locations changed every session. to the difference between decision variables.
In standard reinforcement learning31, the value of an option is the expected
Analysis of locomotion. Locomotor activity toward the rewarding locations (average) reward. In the US, where the choices are at steady-state, the expected
was measured in terms of time-to-goal, speed profile, dwell time and traveled reward is taken as the reward probability
distance. Time-to-goal measures the duration between one choice and the next
Vi = Ei (ICSS) = pi (ICSS) (4)
one. The speed profile corresponds to the instantaneous speed as a function of
time (expressing it as a function of the distance between two locations did not give In models embedding an exploration bonus, the value depends on both
any additional information). We averaged the speed profiles on a 10-s interval expected reward and uncertainty16,17,29. Uncertainty may refer to estimation
(the same used for restricting the choices considered in the analysis), which was uncertainty (due to incomplete knowledge or sampling of the outcome), to
zero-padded if the reward location was attained before 10 s. The dwell time is the expected uncertainty (or reward risk), related to the estimated variability
defined as the duration between the moment of the detection in the last rewarding of the outcome, or to unexpected uncertainty, that is, uncertainty greater than
location and the moment when the animals speed is greater than 10 cm s1. The expected22,23,44. The expected uncertainty scheme is similar to the mean-
traveled distance corresponds to the summation of the local distances between variance approach used in neuroeconomic studies53 and it has also been proposed
two points of the mouses trajectory (20 frames per s) between the last and the to drive exploration19,21,24,25,30. In the US, as mice are trained and choice behavior
next choice. A multiple linear regression was performed on the time-to-goal, in is at steady-state, we used this version of the model, where the decision variable
the different sets of probabilities of the US setting. We compared models with is a compound of the true (that is, not estimated by a learning algorithm) mean
npg

increasing number of explanatory variables. As potential explanatory variables, and variance of the payoff
we included reward history (whether the animal just got rewarded or not, as a
binary variable), the expected reward of the goal, the expected uncertainty of the Vi = Ei (ICSS) + js 2i (ICSS) = pi ( ICSS) + j pi (ICSS)(1 pi ( ICSS)) (5)
goal, the expected reward of the alternative (that is, the location not chosen in
the gamble), and the expected uncertainty of the alternative. We compared these This compound value is then nested in the softmax choice rule. Note
linear models based on their summed squared errors, penalized for complexity that expected uncertainty (i2) can also be estimated through learning
(Bayesian information criterion): BIC(TTG) = n ln(SSE ) + k ln(n), where n is (see equation (10)).
n
the number of observations (time-to-goal, n is the same for all regressions), k the Finally, in the uncertainty-based temperature model (or local control of ran-
number of explanatory variables, and SSE the summed squared errors from the domness34), uncertainty associated with all the possible actions at a state controls
multiple linear regressions. Constant terms were omitted from the formula for the randomness of choices (that is, the temperature parameter). In this strategy,
simplicity, as the BICs of the linear regressions were only used for comparisons. the randomness of action selection does not depend on the variability of the
possible outcomes. In the softmax model (equation (3)), in case where different
Computational models of decision-making. In the US, we investigated how well choices may yield comparable outcomes, the decision process is random even
the transition function (that is, choices) from both genotypes can fit to variants with large ; while a large difference in values results in greedy action selection
of decision-making models. At the end of the US, since mice are trained and even for small . To circumvent this issue, it is possible to normalize the tem-
choice behavior is at steady-state, we only modeled decision-making, and used perature parameter i for each state i.
the settings of the task (that is, reward probabilities) as fixed parameters for the
b0
values of the options (see below). In the DS and in the learning phase of the US, we bi = (6)
modeled both learning (see below) and decision-making, and we evaluated how E(V j 2 ) E(V j )2
well the models fits the animals choices, which were not at steady-state. These
2 2
models are thus based on the estimation of the expected payoffs (value) and where 0 is a constant (free) parameter, whereas E(V j ) E(V j ) represents the
uncertainties of the options, rather than on objective parameters of the task. uncertainty (or variability) of the state i (over all the possible actions j) rather
Decision-making models determined the probability Pi of choosing the than reward uncertainty associated with a particular action.
next state i, as a function (the choice rule) of a decision variable. Because mice Reinforcement learning models determined the evolution of the decision vari-
could not return to the same rewarding location, they had to choose between ables, which are in this case estimations of the task parameters. The values of

nature NEUROSCIENCE doi:10.1038/nn.4223


the rewarding locations were estimated using standard reinforcement models31, In the DS, we fitted the free parameters and initial conditions of these 7 models:
which are based on trial-and-error learning. First, the model computes the dis- standard RL (, , V(0)), asymmetric learning rates RL (, +, , V(0)), RL with
crepancy between the predicted value of the chosen location (Vi) and the actual uncertainty bonus (, , , V(0), (0)), RL with separate learning for value and
reward R at the trial t uncertainty (, , , , V(0), (0)), RL with asymmetric learning rates learning
for value and separate uncertainty learning (, +, , , , V(0), (0)), RL with
d i (t ) = Ri ,t Vi ,t 1 (7) uncertainty learning and unexpected uncertainty bonus (, , , , V(0), (0)),
RL with adaptive (uncertainty-dependent) learning rate (, , , V(0)).
where Ri(t) = 1 or 0 depending on whether the animal was rewarded or not. This In each case, we searched for the free parameters maximizing the respective
reward prediction error is then used to adapt the estimation of the value Vi of the
chosen location only, that is, the values of the other locations are not changed likelihood of the observed choices c at all trials t Pc ,t . We performed the
c
Vi ,t = Vi ,t 1 + ad i ,t (8) fits of all the parameters individually for each animal a, using the population
fit (that is, fit of the average probabilities of choices) as initial conditions. We
where is the learning rate. To test whether nicotinic receptors differentially checked that the mean of individual fits stayed close to the population fit, and
affected the sensitivity to reward and reward omission, we used an asymmetric that the optima was non-local (by examining the Hessian matrix55). We used the
version of reinforcement learning38 fmincon function in Matlab to perform the fits, with the constraints that learning
rates and temperature could not be negative and that learning rates could not
+ exceed 1. To assess goodness-of-fit, we report negative log likelihoods penalized
Vi ,t = Vi ,t 1 + a d i ,t d i ,t > 0
(9) for model complexity (Bayesian information criterion; BIC). Smaller BIC values
d <0
Vi ,t = Vi ,t 1 + a d i ,t i ,t indicate a better fit. Each of these models has been found to fit experimental
data in at least one given experimental condition (for example, behavioral task
where + and are the learning rates for better- or worse-than-expected or species16,17,29,38,39). Here, we aimed at accounting for the difference observed
outcomes. between genotypes, to propose a computational role for the nicotinic modulation
2016 Nature America, Inc. All rights reserved.

We also used an extended version of reinforcement-learning model23,39 to eval- of the VTA. Hence, once the best model is determined, possible differences in
uate the expected uncertainty of the rewarding locations. The rationale behind the free parameters (for example, , , ) between genotypes or conditions point
this model is that uncertain and unpredictable outcomes produce large prediction at the computational role of the 2 subunit-containing nAChRs expressed in the
errors (positive and negative), by definition. Hence squared prediction errors VTA in decision-making processes.
(equation (7)) can be used to estimate unpredictability or uncertainty s 2i ,t
Extension of the uncertainty model to previous experiments on 2KO mice.
s i2,t = s i2,t 1 + a j xi ,t (10) We also aimed at extending our framework by modeling the results from previ-
ous studies focusing on the behavioral differences between WT and 2KO mice
where a j is the learning rate for uncertainty, and i,t is the uncertainty (or risk) with reinforcement learning models embedding an uncertainty-based explora-
prediction error of the option i at trial t, that is, tory bonus (equations (5, 7, 8, 10 and 11)). In these experiments, uncertainty was
not explicitly controlled but was yet present, as in most decision tasks. We thus
xi ,t = d i2,t s i2,t 1 (11) used the main difference found in the model-based analysis of our decision task,
that is, a positive value given to uncertainty in WT, but not to 2KO, mice, and
The uncertainty prediction error corresponds to unexpected uncertainty explored the values for uncertainty estimation to qualitatively match the data.
(uncertainty larger than expected) and we tested whether exploration might All experiments were modeled as MDPs with a discretization of the relevant
be directed by unexpected form of uncertainty, by assigning a bonus to this states for the animals.
error term In the open-field experiment13,42, we used the symbolic decomposition of the
V i*,t = Vi ,t + xi ,t (12) behavior proposed in ref. 42, by splitting the locomotion of the mice into active
versus inactive states, and their positions into center and periphery states.
npg

Finally, uncertainty may exert an indirect effect through learning. It has been The active state corresponds to high-speed navigation, while the inactive state
shown in humans that learning rate itself can increase with sudden changes corresponds to low-speed exploration, mainly composed of rearing, scanning and
in uncertainty37,54. We tested the following adaptive learning rate model37, sniffing behaviors42,43. This double dichotomy gives rise to four states, that we
where learning rate increases when there is a recent increase m in absolute modeled as an MDP with all transitions possible, except for the stay transitions
prediction errors (that is, of one state on itself) and the transitions between periphery-inactive (PI)
a t = a t 1 + f (mt )(1 a t 1 ) mt > 0 and center-inactive (CI) states, which were not found in the data13,42. The dura-
(13) tion of one state was 1 s. We modeled the difference between WT and 2KO mice
a t = a t 1 + f (mt )(a t 1 ) mt < 0
by adding an exploratory bonus to the inactive states in WT mice only, which we
2
where f(m) is a double sigmoid function f (mt ) = sign(mt )(1 e (m / l ) ) where deduced from the experimental (average) transition probability and the softmax
m is the slope of the (recent) smoothed absolute reward prediction errors, decision rule with bonus as follows. In the center-active state, the probability of
d abs d abs going the center-inactive state is given by P (CI CA) = e bVCI (e bVCI + e bVPA ) ,
mt = 2 tabs tabs 1 . Smoothing of absolute prediction errors is achieved by
where VCI and VPA represent the values associated with the center-inactive and
d t + d t 1
the periphery-active states, so we computed the relations between VPA and VCI
d tabs = d tabs
1 (1 a1 ) + d t a1 . The free parameter determines the degree to VCI , and between VPI and VCA, and fitted , VPI and VCI to reproduce the data.
which uncertainty (absolute prediction errors) affects the learning rate, and In the object recognition task40, two objects are placed in an open-field, and the
the other free parameter, 1, determines the initial learning rate and the speed time spent in the objects area is measured as a function of the behavioral sessions.
of d tabs updating. We modeled this task as an MDP using a discretization of space, consisting in 25
In the US, at steady-state, we fitted the free parameters of the four decision- states corresponding to the open-field without objects, and two states correspond-
making models (none for the matching law, for -greedy, for softmax, ing to the objects. The duration of one state was 1 s. We used the uncertainty
and for uncertainty model). In the learning phase of the US, we fitted the free model (no reward being present in the task, we modeled the uncertainties but
parameters of these 4 models: standard RL (, ), RL with uncertainty learning not the values) and we fitted the values of , , , and the initial uncertainties of
and expected uncertainty bonus (, , , ),RL with adaptive (uncertainty- the objects and of the open-field to reproduce the data.
dependent) learning rate (, , ), and RL with uncertainty learning and unex- In the spatial maze40, we modeled an idealized version of this conditioning
pected uncertainty bonus (, , , ). We fixed the initial conditions (V(0) = 1, task, consisting of four states, corresponding to the arms of the maze. One of
and (0) = 0), because the mice underwent the certain setting just beforehand. them delivered a reachable food reward (R = 1 if reached), and was absorbing:

doi:10.1038/nn.4223 nature NEUROSCIENCE


the simulation stopped if the agent (the modeled mouse) reached it. The duration each mouse. The experiments were blind, in the sense that the experimenters
of one state (the mean duration of visiting one arm) was 10 s. We used the (both in behavioral and electrophysiological experiments) were not aware
uncertainty model with a single learning rate (, , , (0), (0)) for simplicity. of which genotype each mouse belonged to.
We simulated the model until the food was reached, and measured the time to Behavioral and model data were analyzed and fitted using Matlab
reach the food, as done in the experiment. (The MathWorks) Electrophysiological data was analyzed using R (The R Project).
In the passive avoidance task41, animals are in a box divided in two (light and Code is available on request. Data are plotted as mean s.e.m. Total number (n)
dark) compartments. The learning phase (which was not modeled) consists in a of observations in each group and statistics used are indicated in figure captions.
single foot shock given in the dark compartment, which arguably induces a nega- Classically comparisons between means were performed using parametric tests
tive prediction error for this state. We simulated this experiment by considering (Student for two groups, or ANOVA for comparing more than two groups) when
a sequential evaluation model representing incentive motivation56, in which the parameters followed a normal distribution (Shapiro test P > 0.05), and non-
agent sequentially evaluates the probability to go to the dark compartment until parametric tests (here, Wilcoxon or Mann-Whitney) when this was not the case.
it decides to accept it. The probability to go to the dark part of the box at any Homogeneity of variances was tested preliminarily and the t tests were Welch-
time is given by corrected if needed. Multiple comparisons were Bonferroni corrected. All statistical
1 tests were two-sided. P > 0.05 was considered to be not statistically significant.
P (D) = b (V q ) (14) A Supplementary Methods Checklist is available.
e D

where is the inverse temperature (sensitivity to value) and a threshold 50. Paxinos, G. & Franklin, K.B. The Mouse Brain in Stereotaxic Coordinates (Gulf
representing the basal locomotor activity of the animal. In this model, the agent Professional Publishing, 2004).
evaluates the probability of going to the dark part, based on its single experi- 51. Grace, A.A. & Bunney, B.S. Intracellular and extracellular electrophysiology of nigral
ence of a foot shock, which induced a single, negative, reward prediction error dopaminergic neurons--1. Identification and characterization. Neuroscience 10,
301315 (1983).
(equation (7)), resulting both in a decrease in value (equation (8)) and an 52. Rokosik, S.L. & Napier, T.C. Intracranial self-stimulation as a positive reinforcer to
increased uncertainty (equation (10)). The time-step for each evaluation was 1 s. study impulsivity in a probability discounting paradigm. J. Neurosci. Methods 198,
We measured the time before the agents go to the dark part, as done in the experi- 260269 (2011).
2016 Nature America, Inc. All rights reserved.

ment41. For each model experiment, standard errors were obtained following 53. DAcremont, M. & Bossaerts, P. Neurobiological studies of risk assessment: a
comparison of expected utility and mean-variance approaches. Cogn. Affect. Behav.
a bootstrap procedure, using the sample size of the original data. Neurosci. 8, 363374 (2008).
54. Behrens, T.E.J., Woolrich, M.W., Walton, M.E. & Rushworth, M.F.S. Learning the value
Statistical analysis. No statistical methods were used to predetermine sample of information in an uncertain world. Nat. Neurosci. 10, 12141221 (2007).
sizes. Our sample sizes are comparable to many studies using similar techniques 55. Daw, N.D. Trial-by-trial data analysis using computational models. in Decision
Making, Affect, and Learning: Attention and Performance XXIII (eds. Delgado, M.R.,
and animal models. We used a pseudo-randomization procedure, in the sense Phelps, E.A. & Robbins, T.W.) 338 (2011).
that in the behavioral experiments, precise parameters (for example, reward 56. McClure, S.M., Daw, N.D. & Montague, P.R. A computational substrate for incentive
probabilities) were pseudo-randomly assigned to each rewarding location for salience. Trends Neurosci. 26, 423428 (2003).
npg

nature NEUROSCIENCE doi:10.1038/nn.4223

Вам также может понравиться