Академический Документы
Профессиональный Документы
Культура Документы
Cholinergic neurotransmission affects decision-making, notably through the modulation of perceptual processing in the cortex.
In addition, acetylcholine acts on value-based decisions through as yet unknown mechanisms. We found that nicotinic acetylcholine
receptors (nAChRs) expressed in the ventral tegmental area (VTA) are involved in the translation of expected uncertainty into
motivational value. We developed a multi-armed bandit task for mice with three locations, each associated with a different reward
2016 Nature America, Inc. All rights reserved.
probability. We found that mice lacking the nAChR b2 subunit showed less uncertainty-seeking than their wild-type counterparts.
Using model-based analysis, we found that reward uncertainty motivated wild-type mice, but not mice lacking the nAChR b2 subunit.
Selective re-expression of the b2 subunit in the VTA was sufficient to restore spontaneous bursting activity in dopamine neurons and
uncertainty-seeking. Our results reveal an unanticipated role for subcortical nAChRs in motivation induced by expected uncertainty
and provide a parsimonious account for a wealth of behaviors related to nAChRs in the VTA expressing the b2 subunit.
Acetylcholine (ACh) has a well-studied role in arousal, learning source16,17. Exploration occurs when an animal actively gathers
and attention1,2 and modulates perceptual decision-making, notably information about alternative choices with the aim of reducing
through its influence over prefrontal cortices 3. Decisions are not the uncertainty level on the consequences of possible actions 1821.
only driven by sensory information, but also by the animals expec- This typically happens in a learning setting when the statistics
tation of the values associated with alternative choices4,5. ACh also of an outcome given a specific action, or its uncertainty, are
affects cost-benefit decision-making6,7, albeit through unknown in the process of being estimated. Once the consequences of
mechanisms. Notably, effects on value-based decisions induced by possible actions have been estimated, the animal can use this
pharmacological manipulations of ACh or dopamine (DA) often knowledge of the environment to exploit reward sources efficiently.
mirror each other 5. Systemic pharmacological manipulation of However, when the outcome of an action is probabilistic, uncer-
either DA or ACh receptors affects the choices between alterna- tainty remains as to what will be the outcome of an action every
tives associated with different delays, costs or risk57. Disentangling time it is performed. This known variability of the outcome of an
the respective implications of ACh and DA in decision-making is of action, as in a repeated lottery, is referred to as expected uncertainty
npg
1Sorbonne Universits, UPMC University Paris 06, Institut de Biologie Paris Seine, UM 119, Paris, France. 2CNRS, UMR 8246, Neuroscience Paris Seine, Paris,
France. 3INSERM, U1130, Neuroscience Paris Seine, Paris, France. 4Institut Pasteur, CNRS UMR 3571, Unit NISC, Paris, France. Correspondence should be
addressed to P.F. (phfaure@gmail.com).
Received 9 October 2015; accepted 9 December 2015; published online 18 January 2016; doi:10.1038/nn.4223
Figure 1 Decisions under uncertainty in a mouse bandit task using a ICSS b Naive
intracranial self-stimulations. (a) Illustration of the spatial multi-armed
bandit task design. Three explicit square locations were placed in the
open field (0.8-m diameter), forming an equilateral triangle (50-cm
10 cm
side). Mice received an intracranial self-stimulation each time they were
detected in the area of one of the rewarding locations. Animals, which
could not receive two consecutive stimulations at the same location, pA
alternated between rewarding locations. (b) Trajectories of one mouse
pB
(5 min) before (left) and after (middle) learning in the CS and US (right). pC
(c) Time to goal (average duration from the last location to the goal) in the
US as a function of the reward probability of the goal. Inset, times to goal
were identical for the three locations in the CS (F(2,18) = 0.53, P = 0.59,
CS US
one-way ANOVA). Insert, individual curves. N = 19 mice. (d) Traveled
distance between two consecutive locations. In the US, WT mice traveled c 6 US d 80
120
more distance when going toward less probable ICSS reward. Light gray, 100
individual curves. (e) Instantaneous speed: in the US, the maximal speed n = 19
75
80
5
RESULTS e 40 US f 45 *** n = 19
P(ICSS)
Mice-adapted multi-armed bandit task based on ICSS 35 US
Instantaneous speed (cm s )
1
40
100%
In uncertain environments, living beings have to decide when 30
50%
to exploit known resources and when to explore alternatives.
Repartition (%)
n.s. 35
25 25%
This exploitation-exploration dilemma is often studied in the 20 30 US
multi-armed bandit task 16,18, in which humans choose between 15
40
%
n.s.
25
different slot machines to discover the richest option. To assess the 10 30
implication of nAChRs in decision-making under uncertainty, we 20
5 20
designed a spatial version of the bandit task adapted to mice. Studies 1 1 1
0 15
of animal choices often rely on food restriction, even though the 0 1 2 3 4 5 6 7 8 9 10 1/4 1/2 1
satiation level is known to affect decisions under uncertainty 26. Time from ICSS (s) ICSS probability
between rewarding locations by performing a sequence of choices. the reward probability associated to each location. As expected,
Mice mostly went directly to the next rewarding location, but some- in the CS, mice treated each rewarding location the same way
times wandered around in the open field before reaching the goal (Fig. 1f). In the US, however, mice visited the locations associated
(Fig. 1b). At each location, mice had to choose which next reward- with higher ICSS probability more often (F(2,18) = 113, P < 0.001,
ing location to go to (amongst the two alternatives) and how directly one-way ANOVA; Fig. 1f). Because mice could not receive two con-
they should get there. secutive ICSSs, the repartition on the rewarding locations (Fig. 1f)
We compared the behavior of wild-type (WT) mice under two set- arose from a sequence of binary choices in three gambles (G1, G2,
tings of ICSS delivery: a certain setting (CS) in which all locations G3) between two respective payoffs (here, G1 = {100 versus 50%},
were associated with a given ICSS, and an uncertain setting (US), in G2 = {50 versus 25%}, G3 = {100 versus 25%}; Fig. 2a,b). For each
which each location was associated with a different probability of ICSS gamble, mice chose the optimal location (associated with the highest
delivery (Fig. 1a). Although trajectories in the CS were stereotyped, probability of reward; Fig. 2b) more than 50% of the time, but less
reward uncertainty induced a markedly different behavioral pattern than 100% of the time. When they had to choose between a certain
in the US (Fig. 1b). The time to goal was identical for the three loca- (100%) and an uncertain (50%) ICSS, mice displayed a low prefer-
tions in the CS (Fig. 1c), but was greater for locations associated with ence (56%) for the optimal location, suggesting a positive inclination
lower reward probabilities in the US (F(2,18) = 6.8, P = 0.002, one-way toward reward uncertainty (Fig. 2a)29,30.
ANOVA; Fig. 1c). More precisely, the reward probability of the goal
affected the traveled distance (F(2,18) = 7.3, P = 0.002; Fig. 1d), but not A positive motivational value to expected uncertainty
the traveling speed (F(2,18) = 0.48, P = 0.62; Fig. 1e) or the dwell times In standard rodent decision tasks in which there is only a single
(Supplementary Fig. 1a). This contrasts with the effect of reward choice, the relative influence of expected value and uncertainty on
intensity, which affected the speed profiles between two rewarding choices is difficult to dissect, as both parameters vary with reward
locations (Supplementary Fig. 1b). Thus, in this setup, reward inten- probability. For binary outcomes (the choice is rewarded or not),
sity affected the invigoration of goal-directed movements, whereas the expected mean reward corresponds to the reward probability p,
Expected uncertainty
(0.5 in G1, 0.25 in G2, 0.75 in G3) and 1 0.25
Expected reward
0.8 0.2 45
expected uncertainties (0.25 in G1, 0.0625 G1 G2 G3
. 5 00 1
. 2 50 2
. 2 00 3
vs (1 ble
vs ( G
vs (1 G
in G2, 0.1875 in G3) are distinct. Bottom, 0.6 0.15
0% %
5% %
5% %
am
)
0.4 0.1
model of locomotion. The time to goal
G
0.2 0.05
depends on both reward history (whether
the mouse received a reward in the previous
0
0 0.2 0.4 0.6 0.8 1
0
cTime-to-goal model
whereas expected uncertainty is related to reward variance, p(1 p) depicts the randomness in choices, whereas the uncertainty-seeking
(Fig. 2a). Expected uncertainty is zero for predictable outcomes parameter represents the value given to expected uncertainty. The
(100% or 0% probability) and maximal at 50% probability (the most positive uncertainty bonus ( = 1.01 0.24, mean s.e.m.) explains
unpredictable outcome). In our setup, the difference in expected the great attractiveness of the 50% choice in G1 by a powerful motiva-
uncertainty and value between the outcomes was distinct for each tion induced by its expected uncertainty. We assessed the robustness
of the three gambles (Fig. 2a), which provides enough constraints of the data and of the model by fitting four sets of probabilities, with
to differentiate between the influence of two co-varying parameters multiple different differences of expected reward and uncertainties
(reward mean and variance). We compared computational models of (Supplementary Fig. 3), and compared alternative models (match-
npg
decision-making16,31 (Online Methods), each representing a different ing law33 and uncertainty-normalized temperature34; Supplementary
influence of expected reward and uncertainty on choices, to assess Fig. 2). Overall, we found that expected uncertainty positively biased
which model best explained the experimental data (Supplementary the choices in WT mice.
Fig. 2). In the epsilon-greedy model, animals always choose the best As stated above, two types of decisions are nested in the task: the
option, minus a fixed probability. In this model, the choices for the sequence of choices (which goal?) and the locomotion (how to reach
optimal reward are identical whatever the gamble is (Fig. 2b), which the goal?). To investigate the influence of uncertainty on the latter,
did not correspond to the experimental data. In the softmax model, we performed multiple linear regressions of time to goal. Comparison
choices depend on the difference between the expected rewards of of linear models (BIC; Online Methods and Supplementary Fig. 2)
the two alternatives. The softmax model formalizes that the larger the revealed that the time to goal depended on the reward probability of
difference in rewards is, the higher the probability to select the best the goal, but not on the alternative (the location not chosen in the
option will be. This model predicts that the proportions of optimal gamble). These observations suggest a dual-stage process in which
choices would be sorted in the following order {G2 < G1 < G3}, dif- animals first choose which location to go to and then how to reach it.
fering from what was found experimentally {G1 < G2 < G3}. Finally, Furthermore, the dependence on reward history (TR = 0.49 0.21,
in the uncertainty model, decision is biased toward actions with the mean s.e.m.) suggests that when mice had just gotten rewarded, they
most uncertain consequences by assigning a bonus value32 to their traveled further in the open field (Fig. 2c). We also found that the
expected uncertainties19,21,24,29. This last model accurately reproduced time to goal was decreased by the expected reward (TE = 1.63 0.16;
the pattern of mice preferences (Fig. 2b) and best accounted for our Fig. 2c) and by the expected uncertainty (T = 1.56 0.33). This
experimental data (Supplementary Fig. 2), as shown by model com- suggests that expected uncertainty increased motivation to go straight
parison (likelihood penalized for the number of parameters, Bayesian toward the rewarding goal. Thus, model-based analyses suggest that,
Information Criterion (BIC); Online Methods). Furthermore, the two in the two decision problems (which location and how to get
parameters of the uncertainty bonus model disentangle two deter- there), mice assign a positive motivational value ( and T) to the
minants of decision-making: the inverse temperature parameter expected uncertainty of the goal.
Cumulative density
***
choices of the three rewarding locations plotted * **
2KO 3 15
as a function of reward probability in the US for 10 cm
Frequency Hz
the WT (black), 2KO (red, n = 11) and 2VEC
%SWB
(blue, n = 12) mice. Insets, individual curves
for the 2KO (top, red) and 2VEC (bottom, 45
b 50
2VEC
2KO
blue) mice. (c) Time to goal (in seconds) as a 40 0 0
0
1 mv
function of reward probability of the goal for the
Repartition (%)
30
Repartition (%)
n.s. 0 100
WT (black), 2KO (red) and 2VEC (blue) mice. 1s
20 %SWB
Insets, individual curves for the 2KO (top, red) 30 50
and 2VEC (bottom, blue) mice. (d) Examples ***
40
2VEC 2KO +Nic
f 2VEC
of in vivo juxtacellular recordings of the firing WT (n = 19)
1 mv
KO(11) 30
pattern of DA neurons from anesthetized WT VEC(12)
(black), 2KO (red) and 2VEC (blue) mice. 15
1/4 1/2 1
20
1/4 1/2 1 10 s
(e) Cumulative distribution of percent of spikes
in a burst (%SWB). Insets, mean frequency 4.5 c 7 g h
WT (n = 46) 2KO (26) 2VEC (70)
(left) and %SWB (right) of VTA DA neurons from ** 5 2KO
80
Time to goal (s)
Max. %SWB
Max. freq
3.5 1
electrophysiological recording illustrating the
***
7
40
effect of intravenous injection of nicotine on the 5 2VEC ***
2016 Nature America, Inc. All rights reserved.
VTA b2*-nAChRs are involved in motivation by uncertainty We next tested whether 2*-nAChRs could affect motivation
In the ICSS bandit task, WT mice displayed a robust preference for by expected uncertainty by acting on VTA DA neurons, which are
uncertain outcomes. Thus, mice estimate expected uncertainty to important for value-based decision-making9,12. Extracellular in vivo
direct their decisions and locomotion29,30. The suggested role of single-unit recordings in anesthetized animals (Fig. 3d) confirmed
ACh in signaling expected uncertainty22 prompted us to investigate that, when compared with those of WT mice, DA neurons from
whether nAChRs are involved in uncertainty-driven motivation. 2KO mice displayed a decreased firing frequency (2.1 versus 3.2 Hz,
We used mice in which the 2 subunit, the most abundant nicotinic T(74) = 2.4, P < 0.001, Welch t test), lacked bursting activity (U = 1,637,
subunit in the brain1,2, was deleted (2KO mice), in our ICSS bandit P = 0.002, Mann-Whitney test; Fig. 3e) and did not respond to a sys-
task. In the CS, 2KO mice (2KO and 2GFP (2KO mice injected temic injection of nicotine (104.6 1.34% from baseline frequency,
with a lentivirus expressing just eGFP); Online Methods) learned the V = 103, P = 0.95, Wilcoxon test; Fig. 3f,g and Supplementary
npg
task and responded to different ICSS current intensities similarly to Fig. 5e,f)13,14. If 2*-nAChRs underlie uncertainty-driven motivation
WT mice (Supplementary Fig. 4), confirming the modest implication in the VTA, then restoring expression of these receptors in the VTA of
of nAChRs in decision-making with certain rewards35,36. In contrast, 2KO mice should restore both the sensitivity to expected uncertainty
in the US (Fig. 3a), 2KO mice systematically chose the location and DA activity. We achieved selective re-expression of the 2 subunit
associated with the highest uncertainty level (that is, 50% probability) in the VTA of 2KO mice (2VEC) using a lentiviral vector13 strategy
to a lower extent than WT mice (T(28) = 5.4, P < 0.001, unpaired (Online Methods). Coronal sections revealed that viral re-expression
t test; Fig. 3b). Furthermore, the relationship between time to goal was restricted to the VTA (Fig. 3h and Supplementary Fig. 5ad).
and reward probability of the goal (F(2,10) = 0.33 P = 0.72, one-way DA cells from 2VEC mice displayed a spontaneous firing frequency
ANOVA; Fig. 3c and Supplementary Fig. 4) was abolished in 2KO (T(156) = 1.6, P = 0.1, unpaired t test) and bursting activity (U = 3288,
animals. These results suggest a role for 2*-nAChRs in decision- P = 0.9, Mann-Whitney test) similar to those observed in WT mice,
making under uncertainty. and responded to nicotine (120.2 4.78% from baseline frequency,
Uncertainty-seeking ()
exploitative choices) in the three gambles, 1.5
Transition (%)
for the WT (black), 2KO (red) and 2VEC 70
1 n.s. 0.65
(blue) mice. Dots, individual data points. 60 0.5 0.6
**
(b) Value of the parameters ( and )
**
50 0
derived from the model-based analysis n.s. 0.55
WT 0.5
(uncertainty model) of the transition 40
KO 0.5
functions for the WT (black), 2KO (red) 30 1
VEC n.s.
and 2VEC (blue) mice. The color code 1.5 0.45
G1 G2 G3 0 0.5 1 1.5 2 2.5
indicates the predicted transition in (100% vs. 50%) (50% vs. 25%) 100% vs. 25%) Exploration parameter ()
gamble 1 (100 versus 50% reward
probability) as a function of the parameters 4.5 Rewarded (on
2KO
4.5 c 2VEC
5
WT
d e
of the model. (c,d) Time to goal as a previous trial) 4
4 4 2 KO
function of reward probability of the goal 3 2 VEC
Parameter fits
3.5 3.5
and reward history for 2KO (c) and 2VEC (d) 2
n.s.
mice. Experimental data (black dots with 3 3 1
n.s.
error bars) and model fit (stripes) 2.5 2.5
0
are displayed as mean s.e.m. Data are Non-rewarded 1
2 Data 2
***
merged from experiments with four sets 2
***
n.s.
of reward probabilities. (e) Regression 1.5
1/4 1/2 3/4 1
1.5
1/4 1/2 3/4 1
3
T0 TR TE T
coefficients from the best-fitting model ICSS probability ICSS probability
of locomotion, corresponding to a
2016 Nature America, Inc. All rights reserved.
constant (T0) and the dependencies on reward history (TR), expected reward (TE) and expected uncertainty (T), for the WT (black), 2KO (red) and
2VEC (blue) mice. Data are presented as mean s.e.m. *P < 0.05, **P < 0.01, ***P < 0.001. n.s., not significant at P > 0.05.
V = 960, P < 0.001, Wilcoxon test), suggesting that, as previously estab- We next used the model-based analysis to characterize the role of
lished13,14, physiological functions were also restored. Notably, 2VEC VTA 2*-nAChRs in decision-making. Transition functions of 2KO
mice differed from 2KO animals, but not from WT mice (Table 1), and WT mice differed in particular in G1 (100 versus 50%, T(28) = 3.54,
when analyzing the uncertainty-related choices (Fig. 3b) and the times P = 0.001, unpaired t test; Fig. 4a), suggesting an alteration of deci-
to goal (Fig. 3c and Supplementary Fig. 4), indicating a restoration of sions under uncertainty. Indeed, the behavior of 2KO mice was best
the WT phenotype following re-expression of 2 in the VTA. explained (Supplementary Fig. 6) either by the softmax model or
the uncertainty model in which the sensitivity to uncertainty was
a st
1 session 2
nd
session 3rd session null on average (T(11) = 0.8, P = 0.44, t test; Fig. 4b). Both models
point toward the same interpretation: 2*-nAChRs are necessary for
translating uncertainty signals into motivational value. Accordingly,
uncertainty-seeking was significantly different in 2KO and WT mice
WT 2KO (T(29) = 2.9, P = 0.007, unpaired t test). Notably, the model-based
analysis supports the conclusion that 2*-nAChRs selectively
re-expressed in the VTA restored the positive value of expected uncer-
tainty (Table 1 and Supplementary Fig. 7). Moreover, the analysis
npg
75
30 30 compared to when it was not (F(2,39) = 0.02, P = 0.98). Together
70
20 20 with the transition model, where the temperature parameter was
10 10 65 not significantly different between genotypes (F(2,39) = 1.6, P = 0.2,
0 0 60
WT 2KO Figure 5 2*-nAChRs affect decision-making under uncertainty in
st
1 2nd 3
rd
1
st
2
nd
3
rd
session session a dynamical foraging task. (a) Top, illustration of the task design.
During each session, animals receive stimulations in two (of three)
e f g potential locations, with the two rewarding locations (indicated by an
60 60 6
Model WT Model 2KO R in the colored circle) changing between sessions. Bottom, behavioral
trajectories in the three 2-min sessions for the WT (black) and 2KO
Uncertainty-seeking ()
50 50 4 *
40 40 2
(red) mice. (b,c) Repartition (in %) on the three locations (color-coded
Choice (%)
=
= I
goal) and locomotion (how to reach it). Return 100
V latency (s) 50
=
b2*-nAChRs and uncertainty-seeking in a dynamic environment
2 f(V,) 0
0 0.5 1 1.5 2 2.5
Having characterized the role of 2*-nAChRs in motivation by expected Dark Light Footshock intensity (mA)
uncertainty at steady state, we next asked whether our results could be
extended to unstable environments. We analyzed the behavior of WT CS task (Fig. 4 and Supplementary Fig. 8). Moreover, model com-
and 2KO mice during the learning sessions of the CS (Supplementary parison (Supplementary Fig. 9) suggested that experimental data
Fig. 8a,b), when reward probabilities were not known yet, and mod- was not better explained by indirect effects arising from learning,
eled it with reinforcement-learning (RL) models16,29,31,37,38 (Online that is, asymmetric (different for reward and omission38) or adaptive
Methods and Supplementary Fig. 8cf). In the standard RL model, (uncertainty-dependent) learning rates37. In summary, our results and
animals learn the expected value of the three rewarding locations using models support the idea that, in WT mice, expected uncertainty exerts
reward prediction errors (the difference between actual reward and a direct motivational effect. By contrast, the behavior of 2KO mice
predicted value)31. In the model, animals use these values to select the could be explained by either the standard RL model or the expected
next action using a softmax decision rule. We extended the standard uncertainty model (Fig. 5f and Supplementary Fig. 8b,d,f). In this
RL model to uncertainty learning. Animals can use reward prediction latter model, the uncertainty-seeking parameter in the 2KO mice
errors to estimate reward uncertainty21,23,24,39: the larger the errors was significantly lower than that of WT mice (T(22) = 2.4, P = 0.027,
(positive or negative) are, the more uncertain the outcomes will be. unpaired t test; Fig. 5g) and not significantly different from zero
This uncertainty RL model best explained the behavior of WT mice (T(22) = 0.6, P = 0.54). These results provide further evidence that
npg
(Supplementary Fig. 8c,e). By contrast, the behavior of 2KO mice 2*-nAChRs are involved in uncertainty-seeking.
was best accounted for by a standard RL model, that is, without uncer-
tainty bonus (Supplementary Fig. 8d,f). Uncertainty-seeking in other nAChRs-related behaviors
To further test the importance of 2*-nAChRs for translat- Finally, using computational approaches, we assessed whether the
ing expected uncertainty into motivational value, we compared role of VTA 2*-nAChRs in uncertainty-seeking might pervade other
the behavior of WT and 2KO mice in a dynamic setting (DS) in decisions about natural rewards, punishments and salient aspects of
which the locations delivering the ICSS reward changed over time the environment. Paradoxically, it has been found that mice lacking
(Online Methods). In the DS, mice underwent three consecu- the 2 subunit perform seemingly better than WT mice, displaying
tive sessions in which only two of the three locations delivered the improved spatial learning40 and passive avoidance41. The spatial
ICSS. Overall, WT and 2KO mice adapted their strategies to these learning test consists of a maze with a reachable food reward at one of
changes in reward contingencies (Fig. 5a). Starting from a random the four arms and an unreachable food at the opposite arm (Fig. 6a).
strategy, both WT and 2KO mice learned the position of the two We simulated this task with a RL model embedding uncertainty-
rewarding locations in the first session (Fig. 5b,c). However, 2KO seeking (Fig. 6b and Online Methods). The model fitted the behavior
mice persevered in their earlier choices throughout the changes in of both strains, as the time to reach the food was greater for WT
outcomes (Fig. 5b,c), resulting in slightly fewer rewarded choices (with an uncertainty bonus) than for 2KO mice (without bonus)
than for WT mice (T(22) = 2.7, P = 0.01, unpaired t test; Fig. 5d). in the early trials40 (Fig. 6c). This slowly decreasing time to reward
Model comparison (RL models; Supplementary Fig. 9 and Online progressively emerges in RL models embedding uncertainty-seeking
Methods) suggested that an uncertainty bonus model best explained (Fig. 6d), but cannot be easily explained in terms of differences in
the behavior of WT mice (Supplementary Fig. 9). This uncertainty initial value (novelty-seeking31,32), learning rates or a combination
model reproduced the choices of WT animals during the changes in of both (Supplementary Fig. 10). Hence, interest for the unreach-
rewarding outcomes (Fig. 5e), with a positive bonus given to uncer- able reward may arise in WT mice from uncertainty, integrated at
tainty ( = 2.18 0.77). This is consistent with the results in the the level of the VTA. We also assessed whether the same explanation
holds in the case of punishment. We simulated the passive avoid- an intrinsic reinforcement signal24 (or an intrinsic incentive) for
ance task, where animals were in a box divided in two compartments which gathering information would be self-satisfactory, helping the
(light and dark). 2KO mice avoided the dark compartment, which animal to better predict its environment19,45. ACh is closely related
was associated with a foot shock, for a longer time than WT mice41 to information processing1,2. We found that the cholinergic control of
(Fig. 6e). Uncertainty-seeking can also explain this difference, as the DA could underpin the motivational properties of information. This
foot shock induces a single negative prediction error, which results in finding could explain the observed similarities when ACh or DA are
uncertainty (Fig. 6f). Expected uncertainty may in that case motivate pharmacologically manipulated during value-based decisions5. Several
WT mice, but not 2KO mice, to explore the dark part of the box in mechanisms may underlie functional ACh-DA interactions in the
spite of potential negative consequences. Finally, these models can be brain. Mesopontine ACh might directly signal expected uncertainty
extended to neutral, but potentially uncertain, outcomes. The deficits (2 in our model), as proposed for forebrain ACh 22. Alternatively,
of 2KO mice in locomotion in an open-field without rewards13,42 our data suggest a contribution of 2*-nAChRs to the spontaneous
can be understood as a lack of uncertainty-seeking (Supplementary excitability of DA neurons, with anesthetized 2KO animals lacking
Fig. 11ae). Exploration in the open-field is composed of action pat- bursting of DA neurons14. In this case, cholinergic signaling onto the
terns related to information-seeking (scanning, rearing and sniff- VTA via 2*-nAChRs could serve as a permissive gate15, rendering
ing42,43). The apparent lack of object recognition observed in 2KO DA neurons more responsive (that is, affecting ) to uncertainty sig-
mice40 can also be interpreted as a lack of curiosity for the objects, that nals generated elsewhere in the mesocorticolimbic loop23. A strong
is, an absence of uncertainty-seeking (Supplementary Fig. 11f,g), prediction of these interpretations would be that ACh is implicated in
rather than a memory deficit. The uncertainty-seeking model not only the encoding of expected uncertainty by DA neurons25. Nevertheless,
generalize our results to positive, aversive and neutral natural out- we cannot totally exclude, with our lentiviral strategy, downstream
comes, but also provides a parsimonious interpretation for a wealth adaptations in 2KO mice or an effect at the level of axon terminals,
of behaviors associated with 2*-nAChRs13,4042. where 2*-nAChRs also influence the transfer function between DA
2016 Nature America, Inc. All rights reserved.
Reprints and permissions information is available online at http://www.nature.com/ 26. Schuck-Paim, C., Pompilio, L. & Kacelnik, A. State-dependent decisions cause
reprints/index.html. apparent violations of rationality in animal choice. PLoS Biol. 2, e402 (2004).
27. Carlezon, W.A. Jr. & Chartoff, E.H. Intracranial self-stimulation (ICSS) in rodents
to study the neurobiology of motivation. Nat. Protoc. 2, 29872995 (2007).
1. Everitt, B.J. & Robbins, T.W. Central cholinergic systems and cognition. Annu. Rev. 28. Kobayashi, T., Nishijo, H., Fukuda, M., Bure, J. & Ono, T. Task-dependent
Psychol. 48, 649684 (1997). representations in rat hippocampal place neurons. J. Neurophysiol. 78, 597613
2. Dani, J.A. & Bertrand, D. Nicotinic acetylcholine receptors and nicotinic cholinergic (1997).
mechanisms of the central nervous system. Annu. Rev. Pharmacol. Toxicol. 47, 29. Funamizu, A., Ito, M., Doya, K., Kanzaki, R. & Takahashi, H. Uncertainty in action-
699729 (2007). value estimation affects both action choice and learning rate of the choice behaviors
3. Guillem, K. et al. Nicotinic acetylcholine receptor 2 subunits in the medial of rats. Eur. J. Neurosci. 35, 11801189 (2012).
prefrontal cortex control attention. Science 333, 888891 (2011). 30. Anselme, P., Robinson, M.J.F. & Berridge, K.C. Reward uncertainty enhances incentive
4. Rangel, A., Camerer, C. & Montague, P.R. A framework for studying the neurobiology salience attribution as sign-tracking. Behav. Brain Res. 238, 5361 (2013).
of value-based decision making. Nat. Rev. Neurosci. 9, 545556 (2008). 31. Sutton, R.S. & Barto, A.G. Reinforcement Learning: an introduction (MIT Press,
5. Fobbs, W.C. & Mizumori, S.J. Cost-benefit decision circuitry: proposed modulatory 1998).
role for acetylcholine. Prog. Mol. Biol. Transl. Sci. 122, 233261 (2014). 32. Kakade, S. & Dayan, P. Dopamine: generalization and bonuses. Neural Netw. 15,
6. Kolokotroni, K.Z., Rodgers, R.J. & Harrison, A.A. Acute nicotine increases both 549559 (2002).
impulsive choice and behavioral disinhibition in rats. Psychopharmacology (Berl.) 33. Herrnstein, R.J. Relative and absolute strength of response as a function of
217, 455473 (2011). frequency of reinforcement. J. Exp. Anal. Behav. 4, 267272 (1961).
7. Mendez, I.A., Gilbert, R.J., Bizon, J.L. & Setlow, B. Effects of acute administration 34. Ishii, S., Yoshida, W. & Yoshimoto, J. Control of exploitation-exploration meta-
of nicotinic and muscarinic cholinergic agonists and antagonists on performance parameter in reinforcement learning. Neural Netw. 15, 665687 (2002).
in different cost-benefit decision making tasks in rats. Psychopharmacology (Berl.) 35. Yeomans, J. & Baptista, M. Both nicotinic and muscarinic receptors in ventral
224, 489499 (2012). tegmental area contribute to brain-stimulation reward. Pharmacol. Biochem. Behav.
8. McGrath, D.S. & Barrett, S.P. The comorbidity of tobacco smoking and gambling: 57, 915921 (1997).
a review of the literature. Drug Alcohol Rev. 28, 676681 (2009). 36. Serreau, P., Chabout, J., Suarez, S.V., Naud, J. & Granon, S. Beta2-containing
9. Schultz, W. Multiple dopamine functions at different time courses. Annu. Rev. neuronal nicotinic receptors as major actors in the flexible choice between conflicting
Neurosci. 30, 259288 (2007). motivations. Behav. Brain Res. 225, 151159 (2011).
10. Waelti, P., Dickinson, A. & Schultz, W. Dopamine responses comply with basic 37. Krugel, L.K., Biele, G., Mohr, P.N., Li, S.-C. & Heekeren, H.R. Genetic variation in
assumptions of formal learning theory. Nature 412, 4348 (2001). dopaminergic neuromodulation influences the ability to rapidly and flexibly adapt
11. Montague, P.R., Dayan, P. & Sejnowski, T.J. A framework for mesencephalic
2016 Nature America, Inc. All rights reserved.
ubiquitous mouse phosphoglycerate kinase (PGK) promoter. Further details using Prolong Gold Antifade Reagent (Invitrogen, P36930). Microscopy was car-
can be found in ref. 13. 2KO mice aged 8 weeks were anesthetized using ried out with a fluorescent microscope, and images captured using a camera and
isoflurane. The mouse was introduced into a stereotaxic frame adapted for ImageJ imaging software.
mice. Lentivirus (2 l at 75 ng of p24 protein per l) was injected bilaterally In the case of electrophysiological recordings, an immmunohistochemical
at: anteroposterior = 3.4 mm, mediolateral = 0.5 mm from bregma and identification of the recorded neurons was performed as described above, with the
dorsoventral = 4.4 mm from the surface for VTA injection. Mice were implanted addition of 1:200 AMCA-conjugated Streptavidin (Jackson ImmunoResearch) in
with electrodes 45 weeks after viral injection. At the end of the behavioral the solution. Neurons labeled for both TH and neurobiotin in the VTA50 allowed
experiments, lentiviral re-expression in the VTA was verified using fluorescence to confirm their neurochemical phenotype.
immunohistochemistry. As a control for 2VEC mice, another group of 2KO
mice were injected with lentivirus expressing eGFP only. We did not observe any Electrode implantation and ICSS training. Mice were introduced into a
difference between 2KO (without lentiviral injections, n = 6) and 2-eGFP mice stereotaxic frame and implanted unilaterally with bipolar stimulating electrodes
(n = 6) in either choices (P = 0.76, unpaired t test) and time-to-goal (P = 0.34, for ICSS in the medial forebrain bundle27,28 (MFB, anteroposterior = 1.4 mm,
unpaired t test). We thus pooled the data from both groups to serve as control mediolateral = 1.2 mm, from the bregma, and dorsoventral = 4.8 mm from the
for 2VEC data. dura). After recovery from surgery (1 week), the efficacy of electrical stimulation
was verified in an open field with an explicit square location (side = 1 cm) at its
In vivo electrophysiological recordings. Extracellular recording electrodes were center. Each time a mouse was detected in the area (D = 3 cm) of the location,
constructed from borosilicate glass tubing (1.5 mm O.D. / 1.17 mm I.D.) using a 200-ms train of 20 0.5-ms biphasic square waves pulsed at 100 Hz was gener-
a vertical electrode puller (Narishige). Tip was broken and electrodes were filled ated by a stimulator28. Mice self-stimulating at least 50 times in a 5-min session
with a 0.5% sodium acetate solution (wt/vol) and 1.5% neurobiotin (wt/vol), were kept for the behavioral sessions (3 mice were excluded at this stage, due
npg
yielding impedances of 6 9 M. to improper electrode implantation). In the certain setting (see below), ICSS
Animals were anesthetized with chloral hydrate (400 mg per kg of body intensity was adjusted so that mice self-stimulated between 50 and 150 times per
weight, intraperitoneal, supplemented as required to maintain optimal anesthe- session at the end of the training (ninth and tenth session). Current intensity was
sia throughout the experiment), and placed in a stereotaxic apparatus (Kopf subsequently maintained the same throughout the uncertainty setting.
Instruments). The left saphenous vein was catheterized for intravenous admin-
istration of nicotine and the right saphenous vein was catheterized for intravenous Behavioral data acquisition. Decision-making and locomotor activity were
administration of saline solution (NaCl 0.9%, wt/vol). The electrophysiological recorded in a 1-m diameter circular open-field. Experiments were performed
activity was sampled in the central region of the VTA (coordinates: 3.13.5 mm using a video camera, connected to a video-track system, out of sight of the
posterior to Bregma, 0.30.6 mm lateral to midline and 44.7 mm below the experimenter. A home-made software (Labview National instrument) tracked the
brain surface)50. Spontaneously active DAergic neurons were identified on the animal, recorded its trajectory (20 frames per s) for 5 min and sent TTL pulses
basis of previously established electrophysiological criteria: (1) a typical triphasic to the ICSS stimulator when appropriate (see below).
action potential with a marked negative deflection; (2) a characteristic long dura-
tion (>2.0 ms); (3) an action potential width from start to negative trough >1.1 ms; Markovian decision problem by ICSS conditioning. We considered two com-
(4) a slow firing rate (between 1 and 10 Hz) with an irregular single spiking pat- plementary aspects of motivation: direction and locomotion of the mice. We thus
tern and occasional short, slow bursting activity51. At least 5 min of spontaneous developed a protocol allowing to record simultaneously the sequential choices
baseline electrophysiological activity was recorded before intravenous injection between differently rewarding locations (that is, associated with intracranial self-
of nicotine (30 g per kg). At the end of the recording period, the neurons were stimulation) and the locomotor activity of the mice in between these locations.
stimulated by application of positive currents steps to electroporate neurobiotin After validation of ICSS behavior27, conditioning tasks took place in the 0.8-m
into the neurons to allow DA neurons identification. diameter circular open-field. Three explicit square locations were placed in the
open field, forming an equilateral triangle (side = 50 cm). Each time a mouse
Analysis of electrophysiological data. DA cell firing was analyzed with respect to was detected in the area of one of the rewarding locations, a stimulation train
the average firing rate and the percentage of spikes within bursts (%SWB, number was delivered. Animals received stimulations only when they alternate between
of spikes within bursts, divided by total number of spikes). Bursts were identified rewarding locations. In separate experiments, the intensity or the probability of
as discrete events consisting of a sequence of spikes such that: their onset is defined stimulation delivery differed for the three rewarding locations. Precise param-
by two consecutive spikes within an interval <80 ms and they terminated with an eters (for example, reward probabilities) were pseudo-randomly assigned to each
The foraging phase assessed the exploratory strategy in a dynamic setting (DS),
which consisted in three consecutive 5-min sessions. In each session, two out of
three locations delivered the ICSS reward, and the identity of the two rewarding where is an inverse temperature parameter reflecting the sensitivity of choice
locations changed every session. to the difference between decision variables.
In standard reinforcement learning31, the value of an option is the expected
Analysis of locomotion. Locomotor activity toward the rewarding locations (average) reward. In the US, where the choices are at steady-state, the expected
was measured in terms of time-to-goal, speed profile, dwell time and traveled reward is taken as the reward probability
distance. Time-to-goal measures the duration between one choice and the next
Vi = Ei (ICSS) = pi (ICSS) (4)
one. The speed profile corresponds to the instantaneous speed as a function of
time (expressing it as a function of the distance between two locations did not give In models embedding an exploration bonus, the value depends on both
any additional information). We averaged the speed profiles on a 10-s interval expected reward and uncertainty16,17,29. Uncertainty may refer to estimation
(the same used for restricting the choices considered in the analysis), which was uncertainty (due to incomplete knowledge or sampling of the outcome), to
zero-padded if the reward location was attained before 10 s. The dwell time is the expected uncertainty (or reward risk), related to the estimated variability
defined as the duration between the moment of the detection in the last rewarding of the outcome, or to unexpected uncertainty, that is, uncertainty greater than
location and the moment when the animals speed is greater than 10 cm s1. The expected22,23,44. The expected uncertainty scheme is similar to the mean-
traveled distance corresponds to the summation of the local distances between variance approach used in neuroeconomic studies53 and it has also been proposed
two points of the mouses trajectory (20 frames per s) between the last and the to drive exploration19,21,24,25,30. In the US, as mice are trained and choice behavior
next choice. A multiple linear regression was performed on the time-to-goal, in is at steady-state, we used this version of the model, where the decision variable
the different sets of probabilities of the US setting. We compared models with is a compound of the true (that is, not estimated by a learning algorithm) mean
npg
increasing number of explanatory variables. As potential explanatory variables, and variance of the payoff
we included reward history (whether the animal just got rewarded or not, as a
binary variable), the expected reward of the goal, the expected uncertainty of the Vi = Ei (ICSS) + js 2i (ICSS) = pi ( ICSS) + j pi (ICSS)(1 pi ( ICSS)) (5)
goal, the expected reward of the alternative (that is, the location not chosen in
the gamble), and the expected uncertainty of the alternative. We compared these This compound value is then nested in the softmax choice rule. Note
linear models based on their summed squared errors, penalized for complexity that expected uncertainty (i2) can also be estimated through learning
(Bayesian information criterion): BIC(TTG) = n ln(SSE ) + k ln(n), where n is (see equation (10)).
n
the number of observations (time-to-goal, n is the same for all regressions), k the Finally, in the uncertainty-based temperature model (or local control of ran-
number of explanatory variables, and SSE the summed squared errors from the domness34), uncertainty associated with all the possible actions at a state controls
multiple linear regressions. Constant terms were omitted from the formula for the randomness of choices (that is, the temperature parameter). In this strategy,
simplicity, as the BICs of the linear regressions were only used for comparisons. the randomness of action selection does not depend on the variability of the
possible outcomes. In the softmax model (equation (3)), in case where different
Computational models of decision-making. In the US, we investigated how well choices may yield comparable outcomes, the decision process is random even
the transition function (that is, choices) from both genotypes can fit to variants with large ; while a large difference in values results in greedy action selection
of decision-making models. At the end of the US, since mice are trained and even for small . To circumvent this issue, it is possible to normalize the tem-
choice behavior is at steady-state, we only modeled decision-making, and used perature parameter i for each state i.
the settings of the task (that is, reward probabilities) as fixed parameters for the
b0
values of the options (see below). In the DS and in the learning phase of the US, we bi = (6)
modeled both learning (see below) and decision-making, and we evaluated how E(V j 2 ) E(V j )2
well the models fits the animals choices, which were not at steady-state. These
2 2
models are thus based on the estimation of the expected payoffs (value) and where 0 is a constant (free) parameter, whereas E(V j ) E(V j ) represents the
uncertainties of the options, rather than on objective parameters of the task. uncertainty (or variability) of the state i (over all the possible actions j) rather
Decision-making models determined the probability Pi of choosing the than reward uncertainty associated with a particular action.
next state i, as a function (the choice rule) of a decision variable. Because mice Reinforcement learning models determined the evolution of the decision vari-
could not return to the same rewarding location, they had to choose between ables, which are in this case estimations of the task parameters. The values of
We also used an extended version of reinforcement-learning model23,39 to eval- of the VTA. Hence, once the best model is determined, possible differences in
uate the expected uncertainty of the rewarding locations. The rationale behind the free parameters (for example, , , ) between genotypes or conditions point
this model is that uncertain and unpredictable outcomes produce large prediction at the computational role of the 2 subunit-containing nAChRs expressed in the
errors (positive and negative), by definition. Hence squared prediction errors VTA in decision-making processes.
(equation (7)) can be used to estimate unpredictability or uncertainty s 2i ,t
Extension of the uncertainty model to previous experiments on 2KO mice.
s i2,t = s i2,t 1 + a j xi ,t (10) We also aimed at extending our framework by modeling the results from previ-
ous studies focusing on the behavioral differences between WT and 2KO mice
where a j is the learning rate for uncertainty, and i,t is the uncertainty (or risk) with reinforcement learning models embedding an uncertainty-based explora-
prediction error of the option i at trial t, that is, tory bonus (equations (5, 7, 8, 10 and 11)). In these experiments, uncertainty was
not explicitly controlled but was yet present, as in most decision tasks. We thus
xi ,t = d i2,t s i2,t 1 (11) used the main difference found in the model-based analysis of our decision task,
that is, a positive value given to uncertainty in WT, but not to 2KO, mice, and
The uncertainty prediction error corresponds to unexpected uncertainty explored the values for uncertainty estimation to qualitatively match the data.
(uncertainty larger than expected) and we tested whether exploration might All experiments were modeled as MDPs with a discretization of the relevant
be directed by unexpected form of uncertainty, by assigning a bonus to this states for the animals.
error term In the open-field experiment13,42, we used the symbolic decomposition of the
V i*,t = Vi ,t + xi ,t (12) behavior proposed in ref. 42, by splitting the locomotion of the mice into active
versus inactive states, and their positions into center and periphery states.
npg
Finally, uncertainty may exert an indirect effect through learning. It has been The active state corresponds to high-speed navigation, while the inactive state
shown in humans that learning rate itself can increase with sudden changes corresponds to low-speed exploration, mainly composed of rearing, scanning and
in uncertainty37,54. We tested the following adaptive learning rate model37, sniffing behaviors42,43. This double dichotomy gives rise to four states, that we
where learning rate increases when there is a recent increase m in absolute modeled as an MDP with all transitions possible, except for the stay transitions
prediction errors (that is, of one state on itself) and the transitions between periphery-inactive (PI)
a t = a t 1 + f (mt )(1 a t 1 ) mt > 0 and center-inactive (CI) states, which were not found in the data13,42. The dura-
(13) tion of one state was 1 s. We modeled the difference between WT and 2KO mice
a t = a t 1 + f (mt )(a t 1 ) mt < 0
by adding an exploratory bonus to the inactive states in WT mice only, which we
2
where f(m) is a double sigmoid function f (mt ) = sign(mt )(1 e (m / l ) ) where deduced from the experimental (average) transition probability and the softmax
m is the slope of the (recent) smoothed absolute reward prediction errors, decision rule with bonus as follows. In the center-active state, the probability of
d abs d abs going the center-inactive state is given by P (CI CA) = e bVCI (e bVCI + e bVPA ) ,
mt = 2 tabs tabs 1 . Smoothing of absolute prediction errors is achieved by
where VCI and VPA represent the values associated with the center-inactive and
d t + d t 1
the periphery-active states, so we computed the relations between VPA and VCI
d tabs = d tabs
1 (1 a1 ) + d t a1 . The free parameter determines the degree to VCI , and between VPI and VCA, and fitted , VPI and VCI to reproduce the data.
which uncertainty (absolute prediction errors) affects the learning rate, and In the object recognition task40, two objects are placed in an open-field, and the
the other free parameter, 1, determines the initial learning rate and the speed time spent in the objects area is measured as a function of the behavioral sessions.
of d tabs updating. We modeled this task as an MDP using a discretization of space, consisting in 25
In the US, at steady-state, we fitted the free parameters of the four decision- states corresponding to the open-field without objects, and two states correspond-
making models (none for the matching law, for -greedy, for softmax, ing to the objects. The duration of one state was 1 s. We used the uncertainty
and for uncertainty model). In the learning phase of the US, we fitted the free model (no reward being present in the task, we modeled the uncertainties but
parameters of these 4 models: standard RL (, ), RL with uncertainty learning not the values) and we fitted the values of , , , and the initial uncertainties of
and expected uncertainty bonus (, , , ),RL with adaptive (uncertainty- the objects and of the open-field to reproduce the data.
dependent) learning rate (, , ), and RL with uncertainty learning and unex- In the spatial maze40, we modeled an idealized version of this conditioning
pected uncertainty bonus (, , , ). We fixed the initial conditions (V(0) = 1, task, consisting of four states, corresponding to the arms of the maze. One of
and (0) = 0), because the mice underwent the certain setting just beforehand. them delivered a reachable food reward (R = 1 if reached), and was absorbing:
where is the inverse temperature (sensitivity to value) and a threshold 50. Paxinos, G. & Franklin, K.B. The Mouse Brain in Stereotaxic Coordinates (Gulf
representing the basal locomotor activity of the animal. In this model, the agent Professional Publishing, 2004).
evaluates the probability of going to the dark part, based on its single experi- 51. Grace, A.A. & Bunney, B.S. Intracellular and extracellular electrophysiology of nigral
ence of a foot shock, which induced a single, negative, reward prediction error dopaminergic neurons--1. Identification and characterization. Neuroscience 10,
301315 (1983).
(equation (7)), resulting both in a decrease in value (equation (8)) and an 52. Rokosik, S.L. & Napier, T.C. Intracranial self-stimulation as a positive reinforcer to
increased uncertainty (equation (10)). The time-step for each evaluation was 1 s. study impulsivity in a probability discounting paradigm. J. Neurosci. Methods 198,
We measured the time before the agents go to the dark part, as done in the experi- 260269 (2011).
2016 Nature America, Inc. All rights reserved.
ment41. For each model experiment, standard errors were obtained following 53. DAcremont, M. & Bossaerts, P. Neurobiological studies of risk assessment: a
comparison of expected utility and mean-variance approaches. Cogn. Affect. Behav.
a bootstrap procedure, using the sample size of the original data. Neurosci. 8, 363374 (2008).
54. Behrens, T.E.J., Woolrich, M.W., Walton, M.E. & Rushworth, M.F.S. Learning the value
Statistical analysis. No statistical methods were used to predetermine sample of information in an uncertain world. Nat. Neurosci. 10, 12141221 (2007).
sizes. Our sample sizes are comparable to many studies using similar techniques 55. Daw, N.D. Trial-by-trial data analysis using computational models. in Decision
Making, Affect, and Learning: Attention and Performance XXIII (eds. Delgado, M.R.,
and animal models. We used a pseudo-randomization procedure, in the sense Phelps, E.A. & Robbins, T.W.) 338 (2011).
that in the behavioral experiments, precise parameters (for example, reward 56. McClure, S.M., Daw, N.D. & Montague, P.R. A computational substrate for incentive
probabilities) were pseudo-randomly assigned to each rewarding location for salience. Trends Neurosci. 26, 423428 (2003).
npg