Вы находитесь на странице: 1из 23

Available online at www.sciencedirect.

com
Journal of
Memory and
Language
Journal of Memory and Language 59 (2008) 390412
www.elsevier.com/locate/jml

Mixed-eects modeling with crossed random eects


for subjects and items
R.H. Baayen a,*, D.J. Davidson b, D.M. Bates c
a
University of Alberta, Edmonton, Department of Linguistics, Canada T6G 2E5
b
Max Planck Institute for Psycholinguistics, P.O. Box 310, 6500 AH Nijmegen, The Netherlands
c
University of Wisconsin, Madison, Department of Statistics, WI 53706-168, USA

Received 15 February 2007; revision received 13 December 2007


Available online 3 March 2008

Abstract

This paper provides an introduction to mixed-eects models for the analysis of repeated measurement data with sub-
jects and items as crossed random eects. A worked-out example of how to use recent software for mixed-eects mod-
eling is provided. Simulation studies illustrate the advantages oered by mixed-eects analyses compared to traditional
analyses based on quasi-F tests, by-subjects analyses, combined by-subjects and by-items analyses, and random regres-
sion. Applications and possibilities across a range of domains of inquiry are discussed.
2007 Elsevier Inc. All rights reserved.

Keywords: Mixed-eects models; Crossed random eects; Quasi-F; By-item; By-subject

Psycholinguists and other cognitive psychologists use A similar logic applies to linguistic materials. Psych-
convenience samples for their experiments, often based olinguists construct materials for the tasks that they
on participants within the local university community. employ by a variety of means, but most importantly,
When analyzing the data from these experiments, partic- most materials in a single experiment do not exhaust
ipants are treated as random variables, because the all possible syllables, words, or sentences that could be
interest of most studies is not about experimental eects found in a given language, and most choices of language
present only in the individuals who participated in the to investigate do not exhaust the possible languages that
experiment, but rather in eects present in language an experimenter could investigate. In fact, two core prin-
users everywhere, either within the language studied, ciples of the structure of language, the arbitrary (and
or human language users in general. The dierences hence statistical) association between sound and mean-
between individuals due to genetic, developmental, envi- ing and the unbounded combination of nite lexical
ronmental, social, political, or chance factors are mod- items, guarantee that a great many language materials
eled jointly by means of a participant random eect. must be a sample, rather than an exhaustive list. The
space of possible words, and the space of possible sen-
tences, is simply too large to be modeled by any other
*
Corresponding author. Fax: +1 780 4920806. means. Just as we model human participants as random
E-mail addresses: baayen@ualberta.ca (R.H. Baayen), variables, we have to model factors characterizing their
Doug.Davidson@fcdonders.ru.nl (D.J. Davidson), bates@stat. speech as random variables as well.
wisc.edu (D.M. Bates).

0749-596X/$ - see front matter 2007 Elsevier Inc. All rights reserved.
doi:10.1016/j.jml.2007.12.005
R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 391

Clark (1973) illuminated this issue, sparked by the of modeling heteroskedasticity and non-spherical
work of Coleman (1964), by showing how language error variance (for either participants or items). Meth-
researchers might generalize their results to the larger ods for estimating linear mixed eect models have
population of linguistic materials from which they sam- addressed each of these concerns, and oer a better
ple by testing for statistical signicance of experimental approach than univariate ANOVA or ordinary least
contrasts with participants and items analyses. Clarks squares regression.
oft-cited paper presented a technical solution to this In what follows, we rst introduce the concepts and
modeling problem, based on statistical theory and com- formalism of mixed eects modeling.
putational methods available at the time (e.g., Winer,
1971). This solution involved computing a quasi-F sta-
tistic which, in the simplest-to-use form, could be Mixed eects model concepts and formalism
approximated by the use of a combined minimum-F sta-
tistic derived from separate participants (F1) and items The concepts involved in a linear mixed eects model
(F2) analyses. In the 30+ years since, statistical tech- will be introduced by tracing the data analysis path of a
niques have expanded the space of possible solutions simple example. Assume an example data set with three
to this problem, but these techniques have not yet been participants s1, s2 and s3 who each saw three items w1,
applied widely in the eld of language and memory stud- w2, w3 in a priming lexical decision task under both
ies. The present paper discusses an alternative known as short and long SOA conditions. The design, the RTs
a mixed eects model approach, based on maximum and their constituent xed and random eects compo-
likelihood methods that are now in common use in nents are shown in Table 1.
many areas of science, medicine, and engineering (see, This table is divided into three sections. The left-
e.g., Faraway, 2006; Fielding & Goldstein, 2006; Gil- most section lists subjects, items, the two levels of
mour, Thompson, & Cullis, 1995; Goldstein, 1995; Pin- the SOA factor, and the reaction times for each com-
heiro & Bates, 2000; Snijders & Bosker, 1999). bination of subject, item and SOA. This section repre-
Software for mixed-eects models is now widely sents the data available to the analyst. The remaining
available, in specialized commercial packages such as sections of the table list the eects of SOA and the
MLwiN (MLwiN, 2007) and ASReml (Gilmour, Gogel, properties of the subjects and items that underly the
Cullis, Welham, & Thompson, 2002), in general com- RTs. Of these remaining sections, the middle section
mercial packages such as SAS and SPSS (themixed proce- lists the xed eects: the intercept (which is the same
dures), and in the open source statistical programming for all observations) and the eect of SOA (a 19 ms
environment R (Bates, 2007). West, Welch, and Galech- processing advantage for the short SOA condition).
ki (2007) provide a guide to mixed models for ve dier- The right section of the table lists the random eects
ent software packages. in the model. The rst column in this section lists
In this paper, we introduce a relatively recent devel- by-item adjustments to the intercept, and the second
opment in computational statistics, namely, the possibil- column lists by-subject adjustments to the intercept.
ity to include subjects and items as crossed, independent, The third column lists by-subject adjustments to the
random eects, as opposed to hierarchical or multilevel eect of SOA. For instance, for the rst subject the
models in which random eects are assumed to be eect of a short SOA is attenuated by 11 ms. The nal
nested. This distinction is sometimes absent in general column lists the by-observation noise. Note that in
treatments of these models, which tend to focus on this example we did not include by-item adjustments
nested models. The recent textbook by West et al. to SOA, even though we could have done so. In the
(2007), for instance, does not discuss models with terminology of mixed eects modeling, this data set
crossed random eects, although it clearly distinguishes is characterized by random intercepts for both subject
between nested and crossed random eects, and advises and item, and by by-subject random slopes (but no
the reader to make use of the lmer() function in R, the by-item random slopes) for SOA.
software (developed by the third author) that we intro- Formally, this dataset is summarized in (1).
duce in the present study, for the analysis of crossed yij Xij b Si si Wj wj ij 1
data.
Traditional approaches to random eects modeling The vector yij represents the responses of subject i to
suer multiple drawbacks which can be eliminated by item j. In the present example, each of the vectors yij
adopting mixed eect linear models. These drawbacks comprises two response latencies, one for the short
include (a) deciencies in statistical power related to and one for the long SOA. In (1), Xij is the design ma-
the problems posed by repeated observations, (b) the trix, consisting of an initial column of ones and followed
lack of a exible method of dealing with missing data, by columns representing factor contrasts and covariates.
(c) disparate methods for treating continuous and cate- For the present example, the design matrix for each sub-
gorical responses, as well as (d) unprincipled methods ject-item combination has the simple form
392 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412

Table 1
Example data set with random intercepts for subject and item, and random slopes for subject
Subj Item SOA RT Fixed ItemInt Random Res
Int SOA SubInt SubSOA
s1 w1 Long 466 522.2 0 28.3 26.2 0 2.0
s1 w2 Long 520 522.2 0 14.2 26.2 0 9.8
s1 w3 Long 502 522.2 0 14.1 26.2 0 8.2
s1 w1 Short 475 522.2 19 28.3 26.2 11 15.4
s1 w2 Short 494 522.2 19 14.2 26.2 11 8.4
s1 w3 Short 490 522.2 19 14.1 26.2 11 11.9
s2 w1 Long 516 522.2 0 28.3 29.7 0 7.4
s2 w2 Long 566 522.2 0 14.2 29.7 0 0.1
s2 w3 Long 577 522.2 0 14.1 29.7 0 11.5
s2 w1 Short 491 522.2 19 28.3 29.7 12.5 1.5
s2 w2 Short 544 522.2 19 14.2 29.7 12.5 8.9
s2 w3 Short 526 522.2 19 14.1 29.7 12.5 8.2
s3 w1 Long 484 522.2 0 28.3 3.5 0 6.3
s3 w2 Long 529 522.2 0 14.2 3.5 0 3.5
s3 w3 Long 539 522.2 0 14.1 3.5 0 6.0
s3 w1 Short 470 522.2 19 28.3 3.5 1.5 2.9
s3 w2 Short 511 522.2 19 14.2 3.5 1.5 4.6
s3 w3 Short 528 522.2 19 14.1 3.5 1.5 13.2
The rst four columns of this table constitute the data normally available to the researcher. The remaining columns parse the RTs into
the contributions from the xed and random eects. Int: intercept, SOA: contrast eect for SOA; ItemInt: by-item adjustments to
the intercept; SubInt: by-subject adjustments to the intercept; SubSOA: by-subject adjustments to the slope of the SOA contrast; Res:
residual noise.

 
1 0 and to the SOA contrast coecient. For the rst subject
Xij 2
1 1 in Table 1,
    
and is the same for all subjects i and items j. The design 1 0 26:2 26:2
S1j s1 ; 5
matrix is multiplied by the vector of population coe- 1 1 11:0 15:2
cients b. Here, this vector takes the form
  which tells us, rst, that for this subject the intercept has
522:2 to be adjusted downwards by 26.2 ms for both the long
b 3
19:0 and the short SOA (the subject is a fast responder) and
second, that in the short SOA condition the eect of
where 522.2 is the coecient for the intercept, and -19 SOA for this subject is attenuated by 11.0 ms. Combined
the contrast for the short as opposed to the long SOA. with the adjustment for the intercept that also applies to
The result of this multiplication is a vector that again the short SOA condition, the net outcome for an arbi-
is identical for each combination of subject and item: trary item in the short SOA condition is 15.2 ms for
 
522:2 this subject.
Xij b 4 Further precision is obtained by bringing the item
503:2
random eect into the model. The Wj matrix is again a
It provides the group means for the long and short SOA. copy of the design matrix Xij. In the present example,
These group means constitute the models best guess only the rst column, the column for the intercept, is
about the expected latencies for the population, i.e., retained. This is because in this particular constructed
for unseen subjects and unseen items. data set the eect of SOA does not vary systematically
The next two terms in Eq. (1) serve to make the mod- with item. The vector wj therefore contains one element
els predictions more precise for the subjects and items only for each item j. This element species the adjust-
actually examined in the experiment. First consider the ment made to the population intercept to calibrate the
random eects structure for Subject. The Si matrix (in expected values for the specic processing costs associ-
this example) is a full copy of the Xij matrix. It is multi- ated with this individual item. For item 1 in our exam-
plied with a vector specifying for subject i the adjust- ple, this adjustment is 28.3, indicating that compared
ments that are required for this subject to the intercept to the population average, this particular item elicited
R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 393

shorter latencies, for both SOA conditions, across all Because the random slopes and intercept are pairwise
subjects. tied to the same observational units, they may be cor-
    related. For our data, q^sint; soa 0:71. These four ran-
1 28:3
Wj w1 28:3 6 dom eects parameters complete the specication of
1 28:3 the quantitative structure of our dataset. We can now
The model specication (1) has as its last term the vector present the full formal specication of the correspond-
of residual errors ij, which in our running example has ing mixed-eects model,
two elements for each combination of subject and item, y Xb Zb ;  N 0; r2 I; b  N 0; r2 R; b ? ;
one error for each SOA.
For subject 1, Eq. (1) formalizes the following vector 10
of sums,

0 1
522:2 0 26:2 0 28:3 2:0
B 522:2
B 0 26:2 0 14:2 9:8 CC
B C
B 522:2 0 26:2 0 14:1 8:2 C
y1 y1j X1j b Ss1 Wwj 1j B C 7
B 522:2
B 19 26:2 11 28:3 15:4 C
C
B C
@ 522:2 19 26:2 11 14:2 8:4 A
522:2 19 26:2 11 14:1 11:9

which we can rearrange in the form of a composite inter- where R is the relative variance-covariance matrix for
cept, followed by a composite eect of SOA, followed by the random eects. The symbol \ indicates indepen-
the residual error. dence of random variables and N denotes the multi-

0 1
522:2  26:2  28:3 0
2:0
0
B 522:2
B  26:2 14:2 0 9:8 C
0 C
B C
B 522:2  26:2 14:1 0 8:2 C
0
y1 B C 8
B 522:2
B  26:2  28:3 19 11 15:4 C
C
B C
@ 522:2  26:2 14:2 19 11 8:4 A
522:2  26:2 14:1 19 11 11:9

In this equation for y1 the presence of by-subject ran- variate normal (Gaussian) distribution. We say that
dom slopes for SOA and the absence of by-item random matrix R is the relative variance-covariance matrix
slopes for SOA is clearly visible. of the random eects in the sense that it is the
The subject matrix S and the item matrix W can be variance of b relative to r2, the scalar variance of
combined into a single matrix often written as Z, and the per-observation noise term . The variance-covari-
the subject and item random eects s and w can likewise ance specication of the model is an important tool
be combined into a single vector generally referenced as to capture non-independence (asphericity) between
b, leading to the general formulation observations.
y Xb Zb : 9 Hypotheses about the structure of the variance-
covariance matrix can be tested by means of likelihood
To complete the model specication, we need to be ratio tests. Thus, we can formally test whether a random
precise about the random eects structure of our data. eect for items is required and that the presence of the
Recall that a random variable is dened as a normal parameter ri in the model specication is actually justi-
variate with zero mean and unknown standard devia- ed. Similarly, we can inquire whether a parameter for
tion. Sample estimates (derived straightforwardly from the covariance of the by-subject slopes and intercepts
Table 1) for the standard deviations of the four ran- contributes signicantly to the models goodness of t.
dom eects in our example are r ^sint 28:11 for the We note that in this approach it is an empirical question
by-subject adjustments to the intercepts, r ^ssoa 9:65 whether random eects for item or subject are required
for the by-subject adjustments to the contrast coe- in the model.
cient for SOA, r^i 24:50 for the by-item adjustments When a mixed-eects model is tted to a data set, its set
^ 8:55 for the residual error.
to the intercept, and r of estimated parameters includes the coecients for the
394 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412

xed eects on the one hand, and the standard deviations all, that we introduce by-subject adjustments to the inter-
and correlations for the random eects on the other hand. cept (again denoted by 1) as well as by-subject adjust-
The individual values of the adjustments made to inter- ments to SOA. In other words, this model includes by-
cepts and slopes are calculated once the random-eects subject and by-item random intercepts, and by-subject
parameters have been estimated. Formally, these adjust- random slopes for SOA. This notation also indicates that
ments, referenced as Best Linear Unbiased Predictors the variances for the two levels of SOA are allowed to be
(or BLUPs), are not parameters of the model. dierent. In other words, it models potential by-subject
heteroskedasticity with respect to SOA. Finally, this spec-
Data analysis ication includes a parameter estimating the correlation
^ sint; soa of the by-subject random eects for slope and
q
We illustrate mixed-eects modeling with R, an open- intercept.
source language and environment for statistical comput- A summary of the model is obtained with
ing (R development core team, 2007), freely available at
http://cran.r-project.org. The lme4 package > summary (priming.lmer)
(Bates, 2005; Bates & Sarkar, 2007) oers fast and reliable Linear mixed-effects model fit by REML
algorithms for parameter estimation (see also West et al., Formula : RTSOA+(1jItem)+(1+SOAjSubj)
2007:14) as well as tools for evaluating the model (using Data : priming
Markov chain Monte Carlo sampling, as explained AIC BIC logLik ML REML
below). deviance deviance
Input data for R should have the structure of the rst 150.0 155.4 69.02 151.4 138.0
block in Table 1, together with an initial header line Random effects :
specifying column names. The data for the rst subject Groups Name Variance Std.Dev Corr
therefore should be structured as follows, using what is Item (Intercept) 613.73 24.774
known as the long data format in R (and as the univar- Subj (Intercept) 803.07 28.338
iate data format in SPSS): SOAshort 136.46 11.682 1.000
Residual 102.28 10.113
number of obs : 18, groups : Item, 3; Subj, 3
Subj Item SOA RT Fixed effects :
1 s1 w1 short 475 Estimate Std. t value
2 s1 w2 short 494 Error
3 s1 w3 short 490
(Intercept) 522.111 21.992 23.741
4 s1 w1 long 466 SOAshort 18.889 8.259 2.287
5 s1 w2 long 520
6 s1 w3 long 502

The summary rst mentions that the model is tted


We load the data, here simply an ASCII text le, into
using restricted maximum likelihood estimation (REML),
R with
a modication of maximum likelihood estimation that is
more precise for mixed-eects modeling. Maximum like-
> priming = read.table("ourdata.txt",
lihood estimation seeks to nd those parameter values
header = TRUE)
that, given the data and our choice of model, make the
SPSS data les (if brought into the long format within models predicted values most similar to the observed
SPSS) can be loaded with read.spss and csv tables values. Discussion of the technical details of model t-
(in long format) are loaded with read.csv. We t the ting is beyond the scope of this paper. However, in the
model of Eq. (10) to the data with Appendix we provide some indication of the kind of
issues involved.
> priming.lmer = lmer(RT  SOA + (1jItem) + The summary proceeds with repeating the model
(1 + SOAjSubj), data = priming) specication, and then lists various measures of good-
The dependent variable, RT, appears to the left of the ness of t. The remainder of the summary contains
tilde operator (), which is read as depends on or is two subtables, one for the random eects, and one for
a function of. The main eect of SOA, our xed eect, the xed eects.
is specied to the right of the tilde. The random intercept The subtable for the xed-eects shows that the
for Item is specied with (1|Item), which is read as a estimates for slope and the contrast coecient for
random eect introducing adjustments to the intercept SOA are right on target: 522.11 for the intercept (com-
(denoted by 1) conditional on or grouped by Item. The pare 522.2 in Table 1), and -18.89 (compare -19.0).
random eects for Subject are specied as For each coecient, its standard error and t-statistic
(1+SOA|Subject). This notation indicates, rst of are listed.
R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 395

Table 2 ther simplify to a model with only random intercepts for


Comparison of sample estimates and model estimates for the subject:
data of Table 1
Parameter Sample Model > priming.lmer2 = lmer(RT  SOA + (1j Item) +
(1jSubj), data = priming)
r
^i 24.50 24.774
r
^sint 28.11 28.338
r
^ssoa 9.65 11.681 In order to verify that this most simple model is
r
^ 8.55 10.113 justied, we carry out a likelihood ratio test (see,
q
^sint; soa 0.71 1.00 e.g., Pinheiro & Bates, 2000, p. 83) that compares
the most specic model priming.lmer2 (which sets
q to the specic value of zero and assumes homoske-
Turning to the subtable of random eects, we observe dasticity) with the more general model prim-
that the rst column lists the main grouping factors: ing.lmer (which does not restrict q a priori and
Item, Subj and the observation noise (Residual). explicitly models heteroskedasticity). The likelihood
The second column species whether the random eect of the more general model (Lg) should be greater than
concerns the intercept or a slope. The third column the likelihood of the more specic model (Ls), and
reports the variances, and the fourth column the square hence the likelihood ratio test statistic 2log(Lg/
roots of these variances, i.e., the corresponding standard Ls) > 0. If g is the number of parameters for the gen-
deviations. The sample standard deviations calculated eral model, and s the number of parameters for the
above on the basis of Table 1 compare well with the restricted model, then the asymptotic distribution of
model estimates, as shown in Table 2. the likelihood ratio test statistic, under the null
The high correlation of the intercept and slope for the hypothesis that the restricted model is sucient, fol-
subject random eects (1.00) indicates that the model lows a chi-squared distribution with g-s degrees of
has been overparameterized. We rst simplify the model freedom. In R, the likelihood ratio test is carried out
by removing the correlation parameter and by assuming with the anova function:

> anova(priming.lmer, priming.lmer2)


Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
priming.lmer2 4 162.353 165.914 77.176
priming.lmer 6 163.397 168.740 75.699 2.9555 2 0.2282

homoskedasticity for the subjects with respect to the The value listed under Chisq equals twice the dier-
SOA conditions, as follows: ence between the log-likelihood (listed under logLik)
for priming.lmer and that of priming.lmer2.
> priming.lmer1 = lmer(RT  SOA + (1jItem) + The degrees of freedom for the chi-squared distribution,
(1jSubj) + (1j SOA:Subj), data = 2, is the dierence between the number of parameters in
priming) the model (listed under Df). It it clear that the removal
> print(priming.lmer1, corr = FALSE) of the parameter for the correlation together with the
parameter for by-subject random slopes for SOA is jus-
Random effects
tied (X 22 2:96; p 0:228). The summary of the sim-
Groups Name Variance Std.Dev. plied model
SOA:Subj (Intercept) 34.039 5.8343
Subj (Intercept) 489.487 22.1243 > print(priming.lmer2, corr = FALSE)
Item (Intercept) 625.623 25.0125 Random effects:
Residual 119.715 10.9414
Groups Name Variance Std.Dev.
number of obs: 18, groups: SOA:Subj, 6; Subj, 3; Item, 3 Item (Intercept) 607.72 24.652
Fixed effects: Subj (Intercept) 499.22 22.343
Estimate Std. Error t value Residual 137.35 11.720

(Intercept) 522.111 19.909 26.23 number of obs: 18, groups: Item, 3; Subj, 3
SOAshort 18.889 7.021 2.69 Fixed effects:
Estimate Std. Error t value
(Here and in the examples to follow, we abbreviated the
(Intercept) 522.111 19.602 26.636
R output.) The variance for the by-subject adjustments SOAshort 18.889 5.525 3.419
for SOA is small, and potentially redundant, so we fur-
396 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412

lists only random intercepts for subject and item, as in mixed-eects models reasonably easily and quickly,
desired. even for complex models t to very large observational
The reader may have noted that summaries for model data sets. However, the temptation to perform hypothe-
objects tted with lmer list standard errors and t-statis- sis tests using t-distributions or F-distributions based on
tics for the xed eects, but no p-values. This is not with- certain approximations of the degrees of freedom in
out reason. these distributions persists. An exact calculation can be
With many statistical modeling techniques we can derived for certain models with a comparatively simple
derive exact distributions for certain statistics calculated structure applied to exactly balanced data sets, such as
from the data and use these distributions to perform occur in text books. In real-world studies the data often
hypothesis tests on the parameters, or to create con- end up unbalanced, especially in observational studies
dence intervals or condence regions for the values of but even in designed experiments where missing data
these parameters. The general class of linear models t can and do occur, and the models can be quite compli-
by ordinary least squares is the prime example of such cated. The simple formulas for the degrees of freedom
a well-behaved class of statistical models for which we for inferences based on t or F-distributions do not apply
can derive exact results, subject to certain assumptions in such cases. In fact, the pivotal quantities for such
on the distribution of the responses (normal, constant hypothesis tests do not even have t or F-distributions
variance and independent disturbance terms). This gen- in such cases so trying to determine the correct value
eral paradigm provides many of the standard techniques of the degrees of freedom to apply is meaningless. There
of modern applied statistics including t-tests and analy- are many approximations in use for hypothesis tests in
sis of variance decompositions, as well as condence mixed modelsthe MIXED procedure in SAS oers 6 dif-
intervals based on t-distributions. It is tempting to ferent calculations of degrees of freedom in certain tests,
believe that all statistical techniques should provide such each leading to dierent p-values, but none of them is
neatly packaged results, but they dont. correct.
Inferences regarding the xed-eects parameters are It is not even obvious how to count the number of
more complicated in a linear mixed-eects model than parameters in a mixed-eects model. Suppose we have
in a linear model with xed eects only. In a model with 1000 subjects, each exposed to 200 items chosen from
only xed eects we estimate these parameters and one a pool of 10000 potential items. If we model the eect
other parameter which is the variance of the noise that of subject and item as independent random eects we
infects each observation and that we assume to be inde- add two variance components to the model. At the esti-
pendent and identically distributed (i.i.d.) with a normal mated parameter values we can evaluate 1000 predictors
(Gaussian) distribution. The initial work by William of the random eects for subject and 10000 predictors of
Gossett (who wrote under the pseudonym of Student) the random eects for item. Did we only add two param-
on the eect of estimating the variance of the distur- eters to the model when we incorporated these 11000
bances on the estimates of precision of the sample mean, random eects? Or should we say that we added several
leading to the t-distribution, and later generalizations by thousand parameters that are adjusted to help explain
Sir Ronald Fisher, providing the analysis of variance, the observed variation in the data? It is overly simplistic
were turning points in 20th century statistics. to say that thousands of random eects amount to only
When mixed-eects models were rst examined, in two parameters. However, because of the shrinkage
that days when the computing tools were considerably eect in the evaluation of the random eects, each ran-
less sophisticated than at present, many approximations dom eect does not represent an independent parameter.
were used, based on analogy to xed-eects analysis of Fortunately, we can avoid this issue of counting
variance. For example, variance components were often parameters or, more generally, the issue of approximat-
estimated by calculating certain mean squares and ing degrees of freedom. Recall that the original purpose
equating the observed mean square to the corresponding of the t and F-distributions is to take into account the
expected mean square. There is no underlying objective, imprecision in the estimate of the variance of the ran-
such as the log-likelihood or the log-restricted-likeli- dom disturbances when formulating inferences regard-
hood, that is being optimized by such estimates. They ing the xed-eects parameters. We can approach this
are simply assumed to be desirable because of the anal- problem in the more general context with Markov chain
ogy to the results in the analysis of variance. Further- Monte Carlo (MCMC) simulations. In MCMC simulations
more, the theoretical derivations and corresponding we sample from conditional distributions of parameter
calculations become formidable in the presence of multi- subsets in a cycle, thus allowing the variation in one
ple factors, such as both subject and item, associated parameter subset, such as the variance of the random
with random eects or in the presence of unbalanced disturbances or the variances and covariances of ran-
data. dom eects, to be reected in the variation of other
Fortunately, it is now possible to evaluate the maxi- parameter subsets, such as the xed eects. This is what
mum likelihood or the REML estimates of the parameters the t and F-distributions accomplish in the case of mod-
R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 397

els with xed-eects only. Crucially, the MCMC technique current residuals. The prior for the variances and covari-
applies to more general models and to data sets with ances of the random eects is chosen so that for the sec-
arbitrary structure. ond subset we sample from a Wishart distribution.
Informally, we can conceive of Markov chain Monte Finally, conditional on the rst two subsets and on the
Carlo (MCMC) sampling from the posterior distribution data the sampling for the third subset is from a multivar-
of the parameters (see, e.g., Andrieu, de Freitas, Doucet, iate normal distribution. The details are less important
& Jordan, 2003, for a general introduction to MCMC) as a than the fact that these are well-accepted non-informa-
random walk in parameter space. Each mixed eects tive priors for these parameters. Starting from the REML
model is associated with a parameter vector, which can estimates of the parameters in the model we cycle
be divided into three subsets, through these steps many times to generate a sample
from the posterior distribution of the parameters. The
1. the variance, r2, of the per-observation noise term, mcmcsamp function produces such a sample, for which
2. the parameters that determine the variance-covari- we plot the estimated densities on a log scale.
ance matrix of the random eects, and
3. the random eects ^ ^
b and the xed eects b. > mcmc = mcmcsamp(priming.lmer2, n = 50000)
> densityplot(mcmc, plot.points = FALSE)
Conditional on the other two subsets and on the
data, we can sample directly from the posterior distribu- The resulting plot is shown in Fig. 1. We can see that the
tion of the remaining subset. For the rst subset we sam- posterior density of the xed-eects parameters is rea-
ple from a chi-squared distribution conditional on the sonably symmetric and close to a normal (Gaussian) dis-
0.0 0.1 0.2 0.3

log(Subj.(In))

0 5 10 15 20
log(Item.(In))
0.00 0.10 0.20

0 5 10 15
0.0 0.2 0.4 0.6 0.8

log(sigma^2)
Density

4 5 6 7 8
0.00 0.02 0.04 0.06

SOAshort

60 40 20 0 20
(Intercept)
0.010
0.000

1000 0 1000 2000 3000

Fig. 1. Empirical density estimates for the Markov chain Monte Carlo sample for the posterior distribution of the parameters in the
model for the priming data with random intercepts only (priming.lmer2). From top to bottom: log r2sint , log r2iint log r2, bsoa, bint.
398 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412

tribution, which is generally the case for such parame- For each of the panels in Fig. 1 we calculate a Bayes-
ters. After we have checked this we can evaluate p-values ian highest posterior density (HPD) condence interval.
from the sample with an ancillary function dened in the For each parameter the HPD interval is constructed from
languageR package, which takes a tted model as the empirical cumulative distribution function of the
input and generates by default 10,000 samples from sample as the shortest interval for which the dierence
the posterior distribution: in the empirical cumulative distribution function values
> mcmc = pvals.fnc(priming.lmer2, nsim = 10000)
> mcmc$fixed
Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(>jtj)
(Intercept) 522.11 521.80 435.45 616.641 0.0012 0.0000
SOAshort 18.89 18.81 32.09 6.533 0.0088 0.0035

We obtain p-values for only the rst two parameters of the endpoints is the nominal probability. In other
(the xed eects). The rst two columns show that the words, the intervals are calculated to have 95% probabil-
model estimates and the mean estimate across MCMC ity content. There are many such intervals. The HPD
samples are highly similar, as expected. The next two intervals are the shortest intervals with the given proba-
columns show the upper and lower 95% highest poster- bility content. Because they are created from a sample,
ior density intervals (see below). The nal two columns these intervals are not deterministic: taking another sam-
show p-values based on the posterior distribution ple gives slightly dierent values. The HPD intervals for
(pMCMC) and on the t-distribution respectively. The the xed eects in the present example are listed in the
degrees of freedom used for the t-distribution by output of pvals.fnc(), as illustrated above. The stan-
pvals.fnc() is an upper bound: the number of obser- dard 95% condence intervals for the xed eects
vations minus the number of xed-eects parameters. As parameters, according to b ^i  ta=2; msb , with the upper
i
a consequence, p-values calculated with these degrees of bound for the degrees of freedom (18  2 = 16) are
freedom will be anti-conservative for small samples.1 narrower:
The distributions of the log-transformed variance
parameters are also reasonably symmetric, although >coefs <- summary(priming.lmer1)@coefs
some rightward skewing is visible in Fig. 1. Without >coefs[, 1] + qt(0.975, 16) * outer(coefs[, 2],
the log transformation, this skewing would be much c(1, 1))
more pronounced: The untransformed distributions [,1] [,2]
would not be approximated well by a normal distribu- (Intercept) 479.90683 564.315397
tion with mean equal to the estimate and standard devi- SOAshort 33.77293 4.004845
ation equal to a standard error. That the distribution of
the variance parameters is not symmetric should not
For small data sets such as the example data considered
come as a surprise. The use of a v2 distribution for a var-
here, they give rise to less conservative inferences that
iance estimate is taught in most introductory statistics
may be incorrect and should be avoided.
courses. As Box and Tiao (1992) emphasize, the loga-
The HPD intervals for the random eects can be
rithm of the variance is a more natural scale on which
obtained from the mcmc object obtained with
to assume symmetry.
pvals.fnc() as follows:

1
For data sets characteristic for studies of memory and > mcmc$random
language, which typically comprise many hundreds or thou- MCMCmean HPD95lower HPD95upper
sands of observations, the particular value of the number of
degrees of freedom is not much of an issue. Whereas the sigma 12.76 7.947 21.55
dierence between 12 and 15 degrees of freedom may have Item.(In) 27.54 6.379 140.96
important consequences for the evaluation of signicance Subj.(In) 32.62 9.820 133.47
associated with a t statistic obtained for a small data set, the
dierence between 612 and 615 degrees of freedom has no It is worth noting that the variances for the ran-
noticeable consequences. For such large numbers of degrees of dom eects parameters may get close to zero but will
freedom, the t distribution has converged, for all practical
never actually be zero. Generally it would not make
purposes, to the standard normal distribution. For large data
sets, signicance at the 5% level in a two-tailed test for the xed
sense to test a hypothesis of the form H0 : r2 = 0 ver-
eects coecients can therefore be gauged informally by sus HA : r2 > 0 for these parameters. Neither Invert-
checking the summary for whether the absolute value of the ing the HPD interval nor using the empirical
t-statistic exceeds 2. cumulative distribution function from the MCMC sam-
R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 399

ple evaluated at zero works because the value 0 can- should be brought under statistical control, at the risk
not occur in the MCMC sample. Using the estimate of of failing to detect otherwise signicant eects.
the variance (or the standard deviation) and a stan- Third, qualitative properties of preceding trials
dard error to create a z statistic is, in our opinion, should be brought under statistical control as well. Here,
nonsense because we know that the distribution of one can think of whether the response to the preceding
the parameter estimates is not symmetric and does trial in a lexical decision task was correct or incorrect,
not converge to a normal distribution. We therefore whether the preceding item was a word or a nonword,
recommend likelihood ratio tests for evaluating a noun or a verb, and so on.
whether including a random eects parameter is justi- Fourth, in tasks using long-distance priming, lon-
ed. As illustrated above, we t a model with and gitudinal eects are manipulated on purpose. Yet
without the variance component and compare the the statistical methodology of the past decades
quality of the ts. The likelihood ratio is a reasonable allowed priming eects to be evaluated only after
test statistic for the comparison but we note that the averaging over subjects or items. However, the details
asymptotic reference distribution of a v2 does not of how a specic prime was processed by a specic
apply because the parameter value being tested is on subject may be revealing about how that subject pro-
the boundary. Therefore, the p-value computed using cesses the associated target presented later in the
the v2 reference distribution is conservative for vari- experiment.
ance parameters. For correlation parameters, which Because mixed-eects models do not require prior
can be both positive or negative, this caveat does averaging, they oer the possibility of bringing all these
not apply. kinds of longitudinal eects straightforwardly into the
statistical model. In what follows, we illustrate this
advantage for a long-distance priming experiment
reported in de Vaan, Schreuder, and Baayen (2007).
Key advantages of mixed-eects modeling Their lexical decision experiment used long-term prim-
ing (with 39 trials intervening between prime and tar-
An important new possibility oered by mixed-eects get) to probe budding frequency eects for
modeling is to bring eects that unfold during the course morphologically complex neologisms. Neologisms were
of an experiment into account, and to consider other preceded by two kinds of prime, the neologism itself
potentially relevant covariates as well. (identity priming) or its base word (base priming).
There are several kinds of longitudinal eects that The data are available in the languageR package in
one may wish to consider. First, there are eects of the CRAN archives (http://cran.r-project.org,
learning or fatigue. In chronometric experiments, for see Baayen, 2008, for further documentation on this
instance, some subjects start out with very short
package) under the name primingHeidPrevRT. After
response latencies, but as the experiment progresses, attaching this data set we t an initial model with
they nd that they cannot keep up their fast initial Subject and Word as random eects and priming
pace, and their latencies progressively increase. Other
Condition as xed-eect factor.
subjects start out cautiously, and progressively tune
in to the task and respond more and more quickly. > attach(primingHeidPrevRT)
By means of counterbalancing, adverse eects of learn- > print(lmer(RT Condition + (1jWord) + (1jSubject)),
ing and fatigue can be neutralized, in the sense that the corr = FALSE)
risk of confounding these eects with critical predictors Random effects:

is reduced. However, the eects themselves are not Groups Name Variance Std.Dev.
brought into the statistical model, and consequently Word (Intercept) 0.0034119 0.058412
experimental noise remains in the data, rendering more Subject (Intercept) 0.0408438 0.202098
Residual 0.0440838 0.209962
dicult the detection of signicance for the predictors
of interest when subsets of subjects are exposed to number of obs: 832, groups: Word, 40; Subject, 26
the same lists of items. Fixed effects:
Second, in chronometric paradigms, the response to a Estimate Std. Error t value
target trial is heavily inuenced by how the preceding (Intercept) 6.60297 0.04215 156.66
trials were processed. In lexical decision, for instance, Conditionheid 0.03127 0.01467 2.13
the reaction time to the preceding word in the experi-
ment is one of the best predictors for the target latency,
with eect sizes that may exceed that of the word fre- The positive contrast coecient for Condition and
quency eect. Often, this predictivity extends from the t > 2 in the summary suggests that long-distance identity
immediately preceding trial to several additional preced- priming would lead to signicantly longer response
ing trials. This major source of experimental noise latencies compared to base priming.
400 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412

However, this counterintuitive inhibitory priming After addition of log Base Frequency as covariate
eect is no longer signicant when the decision latency and trimming of atypical outliers,
at the preceding trial (RTmin1) is brought into the model,
> priming.lmer = lmer(RT log(RTmin1) + ResponseToPrime *
> print(lmer(RT RTtoPrime + Base
log(RTmin1) + Condition + (1jWord) + (1jSubject)), Frequency + Condition + (1j + Word) + (1jSubject))
corr = FALSE) > print(update(priming.lmer,
Random effects: subset = abs(scale(resid(priming.lmer))) < 2.5),
cor = FALSE)
Groups Name Variance Std.Dev. Random effects:
Word (Intercept) 0.0034623 0.058841 Groups Name Variance Std.Dev.
Subject (Intercept) 0.0334773 0.182968
Residual 0.0436753 0.208986 Word (Intercept) 0.00049959 0.022351
Subject (Intercept) 0.02400262 0.154928
number of obs: 832, groups: Word, 40; Subject, 26 Residual 0.03340644 0.182774
Fixed effects:
number of obs: 815, groups: Word, 40; Subject, 26
Estimate Std. Error t value Fixed effects:
(Intercept) 5.80465 0.22298 26.032 Estimate Std. Error t value
log(RTmin1) 0.12125 0.03337 3.633
Conditionheid 0.02785 0.01463 1.903 (Intercept) 4.388722 0.287621 15.259
log(RTmin1) 0.103738 0.029344 3.535
ResponseToPrime 1.560777 0.358609 4.352
The latency to the preceding has a large eect size incorrect
with a 400 ms dierence between the smallest and largest RTtoPrime 0.236411 0.032183 7.346
BaseFrequency 0.009157 0.003590 2.551
predictor values, the corresponding dierence for the
Conditionheid 0.038306 0.014497 2.642
frequency eect was only 50 ms. ResponseToPrime 0.216665 0.053628 4.040
The contrast coecient for Condition changes sign incorrect:
when accuracy and response latency to the prime itself, 40 RTtoPrime
trials back in the experiment, are taken into account.
we observe signicant facilitation from long-distance
> print(lmer(RT log(RTmin1) + ResponseToPrime * identity priming. For a follow-up experiment using
RTtoPrime + Condition + (1j Word) + self-paced reading of continuous text, latencies were
(1jSubject)), + corr = FALSE) likewise codetermined by the reading latencies to the
Random effects: words preceding in the discourse, as well as by the read-
Groups Name Variance Std.Dev. ing latency for the prime. Traditional averaging proce-
Word (Intercept) 0.0013963 0.037367 dures applied to these data would either report a null
Subject (Intercept) 0.0235948 0.153606 eect (for self-paced reading) or would lead to a com-
Residual 0.0422885 0.205642
pletely wrong interpretation of the data (lexical deci-
number of obs: 832, groups: Word, 40; Subject, 26 sion). Mixed-eects modeling allows us to avoid these
Fixed effects: pitfalls, and makes it possible to obtain substantially
Estimate Std. Error t value improved insight into the structure of ones experimental
(Intercept) 4.32436 0.31520 13.720
data.
log(RTmin1) 0.11834 0.03251 3.640
ResponseTo 1.45482 0.40525 3.590
Primeincorrect Some common designs
RTtoPrime 0.22764 0.03594 6.334
Conditionheid 0.02657 0.01618 1.642
ResponseToPrime 0.20250 0.06056 3.344 Having illustrated the important analytical advanta-
incorrect: ges oered by mixed-eects modeling with crossed ran-
RTtoPrime dom eects for subjects and items, we now turn to
consider how mixed-eects modeling compares to tradi-
The table of coecients reveals that if the prime had tional analysis of variance and random regression.
elicited a nonword response and the target a word Raaijmakers, Schrijnemakers, and Gremmen (1999) dis-
response, response latencies to the target were slowed cuss two common factorial experimental designs and
by some 100 ms, compared to when the prime elicited their analyses. In this section, we rst report simulation
a word response. For such trials, the response latency studies using their designs, and compare the perfor-
to the prime was not predictive for the target. By con- mance of current standards with the performance of
trast, the reaction times to primes that were accepted mixed-eects models. Simulations were run in R (version
as words were signicantly correlated with the reaction 2.4.0) (R development core team, 2007) using the lme4
time to the corresponding targets. package of Bates and Sarkar (2007) (see also Bates,
R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 401

2005). The code for the simulations is available in the Table 3


languageR package in the CRAN archives (http:// Mean squares decomposition for the data exemplifying the use
cran.r-project.org, see Baayen, 2008). We then of quasi-F ratios in Raaijmakers et al. (1999)
illustrate the robustness of mixed-eects modeling to Df Sum Sq Mean Sq
missing data for a split-plot design, and then pit SOA 1 8032.6 8032.6
mixed-eects regression against random regression, as Item 6 22174.5 3695.7
proposed by Lorch and Myers (1990). Subject 7 26251.6 3750.2
SOA*Subject 7 7586.7 1083.8
A design traditionally requiring quasi-F ratios Item*Subject 42 4208.8 100.2
Residuals 0 0.0
A constructed dataset discussed by Raaijmakers et al.
(1999) comprises 64 observations with 8 subjects and 8
items. Items are nested under treatment: 4 items are pre- The model summary lists four random eects: random
sented with a short SOA, and 4 with a long SOA. Sub- intercepts for participants and for items, by-participant
jects are crossed with item. A quasi-F test, the test random slopes for SOA, and the residual error. Each ran-
recommended by Raaijmakers et al. (1999), based on dom eect is paired with an estimate of the standard devi-
the mean squares in the mean squares decomposition ation that characterizes the spread of the random eects
shown in Table 3 shows that the eect of SOA is not sig- for the slopes and intercepts. Because the by-participant
nicant (F(1.025, 9.346) = 1.702,p = 0.224). It is note- BLUPs for slopes and intercepts are paired observations,
worthy that the model ts 64 data points with the help the model specication that we used here allows for these
of 72 parameters, 6 of which are inestimable. two random variables to be correlated. The estimate of
The present data set is available in the languageR this correlation (r = 0.813) is the nal parameter of the
package as quasif. We t a mixed eects model to present mixed eects model.
the data with The p-value for the t-test obtained with the mixed-
> quasif = lmer(RT  SOA + (1jItem) + eects model is slightly smaller than that produced by
(1 + SOAjSubject), data = quasif) the quasi-F test. However, for the present small data
set the MCMC p-value is to be used, as the p-value with
and inspect the estimated parameters with the above mentioned upper bound for the degrees of
> summary(quasif) freedom is anticonservative. To see this, consider Table
Random effects: 4, which summarizes Type I error rate and power across
Groups Name VarianceStd.Dev.Corr simulated data sets, 1000 with and 1000 without an eect
of SOA. The number of simulation runs is kept small on
Item (Intercept) 448.29 21.173
Subject (Intercept) 861.99 29.360 purpose: These simulations are provided to illustrate
SOAshort 502.65 22.420 0.813 only main trends in power and error rate.
Residual 100.31 10.016 For each simulated data set, ve analyses were con-
ducted: a mixed-eects analysis with the anticonservative
number of obs: 64, groups: Item, 8; Subject, 8
p-value based on the t-test and the appropriate p-value
Fixed effects:
based on 10,000 MCMC samples generated from the poster-
Estimate Std.Error t value ior distribution of the parameters of the tted mixed-
(Intercept)540.91 14.93 36.23 eects model, a quasi-F test, a by-participant analysis, a
SOAshort 22.41 17.12 1.31 by-item analysis, and an analysis that accepted the eect
of SOA to be signicant only if both the F1 and the F2 test
The small t-value for the contrast coecient for SOA were signicant (F1 + F2, compare Forster & Dickinson,
shows that this predictor is not signicant. This is clear 1976). This anticonservatism of the t-test is clearly visible
as well from the summary of the xed eects produced in Table 4.
by pvals.fnc (available in the languageR package), The only procedures with nominal Type I error rates
which lists the estimates, their MCMC means, the corre- are the quasi-F test and the mixed-eects model with
sponding HPD intervals, the two-tailed MCMC probability, MCMC sampling. For data sets with few observations, the
and the two-tailed probability derived from the t-test quasi-F test emerges as a good choice with somewhat
using, as mentioned above, the upper bound for the greater power.
degrees of freedom.
> pvals.fnc(quasif, nsim = 10000)
Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(>jtj)
(Intercept) 540.91 540.85 498.58 583.50 0.0001 0.0000
SOAshort 22.41 22.38 32.88 76.29 0.3638 0.1956
402 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412

Table 4
Proportions (for 1000 simulation runs) of signicant treatment eects for mixed-eects models (lmer), quasi-F tests, by-participant and
by-item analyses, and the combined F1 and F2 test, for simulated models with and without a treatment eect for a data set with 8
subjects and 8 items
lmer: p(t) lmer: p(MCMC) quasi-F By-subject By-item F1+F2
Without treatment eect
a = 0.05 0.088 0.032 0.055 0.310 0.081 0.079
a = 0.01 0.031 0.000 0.005 0.158 0.014 0.009
With treatment eect
a = 0.05 0.16 0.23
a = 0.01 0.04 0.09
Markov Chain Monte Carlo estimates of signicance are denoted by MCMC. Power is tabulated only for models with nominal Type 1
error rates. Too high Type 1 error rates are shown in bold.

Table 5
Proportions (for 1000 simulation runs) of signicant treatment eects for mixed-eects models (lmer), quasi-F tests, by-participant and
by-item analyses, and the combined F1 and F2 test, for simulated models with and without a treatment eect for 20 subjects and 40
items
lmer: p(t) lmer: p(MCMC) quasi-F Subject Item F1 + F2
Without treatment eect
a = 0.05 0.055 0.027 0.052 0.238 0.102 0.099
a = 0.01 0.013 0.001 0.009 0.120 0.036 0.036
With treatment eect
a = 0.05 0.823 0.681 0.809
a = 0.01 0.618 0.392 0.587
Power is tabulated only for models with nominal Type 1 error rates. Too high Type 1 error rates are shown in bold.

Most psycholinguistic experiments yield much larger 20% of the datapoints are randomly deleted before the
numbers of data points than in the present example. analyses are performed, the quasi-F test emerges as
Table 5 summarizes a second series of simulations in slightly conservative (Type 1 error rate: 0.045 for
which we increased the number of subjects to 20 and a = 0.05, 0.006 for a = 0.010), whereas the mixed-eects
the number of items to 40. As expected, the Type I error model using the t test is on target (Type 1 error rate:
rate for the mixed-eects models evaluated with tests 0.052 for a = 0.05, 0.010 for a = 0.01). Power is slightly
based on p-values using the t-test are now in accordance greater for the mixed analysis evaluating probability using
with the nominal levels, and power is perhaps slightly the t-test (a = 0.05: 0.84 versus 0.81 for the quasi-F test;
larger than the power of the quasi-F test. Evaluation a = 0.01: 0.57 versus 0.54). See also, e.g., Pinheiro and
using MCMC sampling is conservative for this specic Bates (2000).
fully balanced example. Depending on the costs of a
Type I error, the greater power of the t-test may oset A Latin Square design
its slight anti-conservatism. In our experience, the dier-
ence between the two p-values becomes very small for Another design discussed by Raaijmakers and col-
data sets with thousands instead of hundreds of observa- leagues is the Latin Square. They discuss a second con-
tions. In analyses where MCMC-based evaluation and t- structed data set, with 12 words divided over 3 lists with
based evaluation yield a very similar verdict across coef- 4 words each. These lists were rotated over participants,
cients, exceptional disagreement, with MCMC sampling such that a given participant was exposed to a list for
suggesting clear non-signicance and the t-test suggest- only one of three SOA conditions. There were 3 groups
ing signicance, is a diagnostic of an unstable and sus- of 4 participants, each group of participants was
pect parameter. This is often conrmed by inspection exposed to unique combinations of list and SOA. Raaij-
of the parameters posterior density. makers and colleagues recommend a by-participant
It should be kept in mind that real life experiments are analysis that proceeds on the basis of means obtained
characterized by missing data. Whereas the quasi-F test is by averaging over the words in the lists. An analysis of
known to be vulnerable to missing data, mixed-eects variance is performed on the resulting data set which
models are robust in this respect. For instance, in 1000 lists, for each participant, three means; one for each
simulation runs (without an eect of SOA) in which SOA condition. This gives rise to the ANOVA decomposi-
R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 403

Table 6 >summary(latinsquare.lmer4)
Mean squares decomposition for the data with a Latin Square Random effects:
design in Raaijmakers et al. (1999)
Groups Name Variance Std.Dev.
Df Sum Sq Mean Sq
Word (Intercept) 754.542 27.4689
Group 2 1696 848 Subject (Intercept) 1476.820 38.4294
SOA 2 46 23 Residual 96.566 9.8268
List 2 3116 1558
Group*Subject 9 47305 5256 number of obs: 144, groups: Word, 12; Subject, 12
SOA*List 2 40 20 Fixed effects:
Residuals 18 527 29
Estimate Std. Error t value
(Intercept) 533.9583 13.7098 38.95
SOAmedium 2.1250 2.0059 1.06
tion shown in Table 6. The F test compares the mean SOAshort 0.4583 2.0059 0.23
squares for SOA with the mean squares of the interac-
tion of SOA by List, and indicates that the eect of
SOA is not statistically signicant (F(2, 2) = 1.15, The summary of this model lists the three random
p = 0.465). As the interaction of SOA by List is not sig- eects and the corresponding parameters: the variances
nicant, Raaijmakers et al. (1999) pool the interaction (and standard deviations) for the random intercepts
with the residual error. This results in a pooled error for subjects and items, and for the residual error. The
term with 20 degrees of freedom, an F-value of 0.896, xed-eects part of the model provides estimates for
and a slightly reduced p-value of 0.42. the intercept and for the contrasts for medium and short

> pvals.fnc(latinsquare.lmer4, nsim = 10000)


Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(>jtj)
(Intercept) 533.9583 534.0570 503.249 561.828 0.0001 0.0000
SOAmedium 2.1250 2.1258 1.925 5.956 0.2934 0.2912
SOAshort 0.4583 0.4086 4.331 3.589 0.8446 0.8196

A mixed-eects analysis of the same data set (avail- SOA compared to the reference level, long SOA. Inspec-
able as latinsquare in the languageR package) tion of the corresponding p-values shows that the
obviates the need for prior averaging. We t a sequence p-value based on the t-test and that based on MCMC sam-
of models, decreasing the complexity of the random pling are very similar, and the same holds for the p-value
eects structure step by step. produced by the F-test for the factor SOA

> latinsquare.lmer1 = lmer2(RT  SOA + (1jWord) + (1jSubject) + (1jGroup) + (1 + SOA jList),


data = latinsquare)
> latinsquare.lmer2 = lmer2(RT  SOA + (1jWord) + (1j Subject) + (1jGroup) + (1jList), data =
latinsquare)
> latinsquare.lmer3 = lmer2(RT  SOA + (1jWord) + (1 jSubject) + (1jGroup), data = latinsquare)
> latinsquare.lmer4 = lmer2(RT  SOA + (1jWord) + (1 jSubject), data = latinsquare)
> latinsquare.lmer5 = lmer2(RT  SOA + (1jSubject), data = latinsquare)
> anova(latinsquare.lmer1, latinsquare.lmer2, latinsquare.lmer3, latinsquare.lmer4,
latinsquare.lmer5)
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
latinsquare.lmer5.p 4 1423.41 1435.29 707.70
latinsquare.lmer4.p 5 1186.82 1201.67 588.41 238.59 1 <2 e-16
latinsquare.lmer3.p 6 1188.82 1206.64 588.41 0.00 1 1.000
latinsquare.lmer2.p 7 1190.82 1211.61 588.41 1.379e-06 1 0.999
latinsquare.lmer1.p 12 1201.11 1236.75 588.55 0.00 5 1.000

The likelihood ratio tests show that the model with Sub- (F(2, 141) = 0.944, p = 0.386) and the corresponding
ject and Word as random eects has the right level of p-value calculated from the MCMC samples (p = 0.391).
complexity for this data set. The mixed-eects analysis has slightly superior power
404 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412

Table 7
Proportions (out of 1000 simulation runs) of signicant F-tests for a Latin Square design with mixed-eects models (lmer) and a by-
subject analyis (F1)
Without SOA With SOA
lmer: p(F) lmer: p(MCMC) F1 lmer: p(F) lmer: p(MCMC) F1
a = 0.05 With 0.055 0.053 0.052 0.262 0.257 0.092
a = 0.01 With 0.011 0.011 0.010 0.082 0.080 0.020
a = 0.05 Without 0.038 0.036 0.043 0.249 0.239 0.215
a = 0.01 Without 0.010 0.009 0.006 0.094 0.091 0.065
The upper part reports simulations in which the F1 analysis includes the interaction of List by SOA (With), the lower part reports
simulations in which for the F1 analysis this interaction is absent (Without).

Table 8
Type I error rate and power for mixed-eects modeling of 1000 simulated data sets with a split-plot design, for the full data set and a
data set with 20% missing data
Type I error rate Power
Full Missing Full Missing
a = 0.05 0.046 (0.046) 0.035 (0.031) 0.999 (0.999) 0.995 (0.993)
a = 0.01 0.013 (0.011) 0.009 (0.007) 0.993 (0.993) 0.985 (0.982)
Evaluation based on Markov chain Monte Carlo sampling are listed in parentheses.

compared to the F1 analysis proposed by Raaijmakers possibility of bringing covariates gauging properties of
et al. (1999), as illustrated in Table 7, which lists Type the individual words into the model is restricted to the
I error rate and power for 1000 simulation runs without mixed-eects analysis.
and with an eect of SOA. Simulated datasets were con-
structed using the parameters given by latin- A split-plot design
square.lmer4. The upper half of Table 7 shows
power and Type I error rate for the situation in which Another design often encountered in psycholinguis-
the F1 analysis includes the interaction of SOA by List, tic studies is the split plot design. Priming studies often
the lower half reports the case in which this interaction is make use of a counterbalancing procedure with two
pooled with the residual error. Even for the most power- sets of materials. Words are primed by a related prime
ful test suggested by Raaijmakers et al. (1999), the in List A and by an unrelated prime in List B, and vice
mixed-eects analysis emerges with slightly better versa. Dierent subjects are tested on each list. This is
power, while maintaining the nominal Type-I error rate. a split-plot design, in the sense that the factor List is
Further pooling of non-explanatory parameters in between subjects and the factor Priming within sub-
the F1 approach may be expected to lead to further con- jects. The following example presents an analysis of
vergence of power. The key point that we emphasize an articial dataset (dat, available as splitplot in
here is that the mixed-eects approach obtains this the languageR package) with 20 subjects, 40 items.
power without prior averaging. As a consequence, it is A series of likelihood ratio tests on a sequence of mod-
only the mixed-eects approach that aords the possibil- els with decreasing complex random eects structure
ity of bringing predictors for longitudinal eects and shows that a model with random intercepts for subject
inter-trial dependencies into the model. Likewise, the and item suces.

> dat.lmer1 = lmer(RT list * priming + (1 + primingjsubjects) + (1 + listjitems), data = dat)


> dat.lmer2 = lmer(RT list * priming + (1 + primingjsubjects) + (1jitems), data = dat)
> dat.lmer3 = lmer(RT list * priming + (1jsubjects) + (1jitems), data = dat)
> dat.lmer4 = lmer(RT list * priming + (1jsubjects), data = dat)
> anova(dat.lmer1, dat.lmer2, dat.lmer3, dat.lmer4)
R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 405

Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)


dat.lmer4.p 5 9429.0 9452.4 4709.5
dat.lmer3.p 6 9415.0 9443.1 4701.5 15.9912 1 6.364e-05
dat.lmer2.p 8 9418.8 9456.3 4701.4 0.1190 2 0.9423
dat.lmer1.p 10 9419.5 9466.3 4699.7 3.3912 2 0.1835

> print(dat.lmer3, corr = FALSE)


Random effects:
Groups Name Variance Std.Dev.
items (Intercept) 447.15 21.146
subjects (Intercept) 2123.82 46.085
Residual 6729.24 82.032

Number of obs: 800, groups: items, 40; subjects, 20


Fixed effects:
Estimate Std. Error t value
(Intercept) 362.658 16.382 22.137
listlistB 18.243 23.168 0.787
primingunprimed 31.975 10.583 3.021
listlistB: 6.318 17.704 0.357
primingunprimed

The estimates are close to the parameters that generated is assessed by means of a one-sample t-test applied to
the simulated data: ri = 20, rs = 50, r = 80, bint = 400, the coecients of this predictor in the individual regres-
bpriming = 30, blist = 18.5, blist:priming = 0. sion models. We refer to this procedure as by-participant
Table 8 lists power and Type I error rate with respect to regression. It is also known under the name of random
the priming eect for 1000 simulation runs with a mixed- regression. (From our perspective, these procedures com-
eect model, run once with the full data set, and once with bine precise and imprecise information on an equal foot-
20% of the data points randomly deleted, using the same ing.) Some studies report both by-item and by-participant
parameters that generated the above data set. It is clear regression models (e.g., Alegre & Gordon, 1999).
that with the low level of by-observation noise, the pres- The by-participant regression is widely regarded as
ence of a priming eect is almost always detected. Power superior to the by-item regression. However, the by-par-
decreases only slightly for the case with missing data. Even ticipant regression does not take item-variability into
though power is at ceiling, the Type I error rate is in accor- account. To see this, compare an experiment in which
dance with the nominal levels. Note the similarity between each participant responds to the same set of words to an
evaluation of signicance based on the (anticonservative) experiment in which each participant responds to a dier-
t-test and evaluation based on Markov chain Monte Car- ent set of words. When the same lexical predictors are used
lo sampling. This example illustrates the robustness of in both experiments, the by-participant analysis proceeds
mixed eects models with respect to missing data: The in exactly the same way for both. But whereas this
present results were obtained without any data pruning approach is correct for the second experiment, it ignores
and without any form of imputation. a systematic source of variation in the case of the rst
experiment.
A multiple regression design A simulation study illustrates that ignoring item var-
iability that is actually present in the data may lead to
Multiple regression designs with subjects and items, unacceptably high Type I error rates. In this simulation
and with predictors that are tied to the items (e.g., fre- study, we considered three predictors, X, Y and Z tied to
quency and length for items that are words) have tradi- 20 items, each of which was presented to 10 participants.
tionally been analyzed in two ways. One approach In one set of simulation runs, these predictors had beta
aggregates over subjects to obtain item means, and then weights 2, 6 and 4. In a second set of simulation runs,
proceeds with standard ordinary least squares regression. the beta weight for Z was set to zero. We were interested
We refer to this as by-item regression. Another approach, in the power and Type I error rates for Z for by-partic-
advocated by Lorch and Myers (1990), is to t separate ipant and for by-item regression, and for two dierent
regression models to the data sets elicited from the indi- mixed-eects models. The rst mixed-eects model that
vidual participants. The signicance of a given predictor we considered included crossed random eects for par-
406 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412

Table 9 Table 10
Proportion of simulation runs (out of 1000) in which the Actual and estimated standard deviations for simulated regres-
coecients for the intercept and the predictors X, Y and Z are sion data
reported as signicantly dierent from zero according to four
Item Subject Residual
multiple regression models
Data 40 80 50
lmer 39.35 77.22 49.84
lmerS 76.74 62.05

the standard deviation for the subject random eect is


slightly underestimated.
We note that for real datasets, mixed-eects regres-
sion oers the possibility to include not only item-bound
predictors, but also predictors tied to the subjects, as
well as predictors capturing inter-trial dependencies
and longitudinal eects.

Further issues

Some authors, e.g., Quene and Van den Bergh (2004),


have argued that in experiments with subjects and items,
lmer: mixed-eect regression with crossed random eects for
subject and item; lmerS: mixed-eect model with subject as ran-
items should be analyzed as nested under subjects. The
dom eect; Subj: by-subject regression; Item: by-item regression. nesting of items under participants creates a hierarchical
mixed-eects model. Nesting is argued to be justied on
the grounds that items may vary in familiarity across
ticipant and item with random intercepts only. This participants. For instance, if items are words, than lexi-
model reected exactly the structure implemented in cal familiarity is known to vary considerably across
the simulated data. A second mixed-eects model occupations (see, e.g., Gardner, Rothkopf, Lapan, &
ignored the item structure in the data, and included only Laerty, 1987). Technically, however, nesting amounts
participant as a random eect. This model is the mixed- to the strong assumption that there need not be any
eects analogue to the by-participant regression. commonality at all for a given item across participants.
Table 9 reports the proportions of simulation runs This strong assumption is justied only when the pre-
(on a total of 1000 runs) in which the coecients of dictors in the regression model are treatments adminis-
the regression model were reported as signicantly dif- tered to items that otherwise do not vary on
ferent from zero at the 5% and 1% signicance levels. dimensions that might in any way aect the outcome
The upper part of Table 9 reports the proportions for of the experiment. For many linguistic items, predictors
simulated data in which an eect of Z was absent, with are intrinsically bound to the items. For instance, when
bZ = 0. The lower part of the table lists the correspond- items are words, predictors such as word frequency and
ing proportions for simulations in which Z was present word length are not treatments administered to items.
(bZ = 4). The bolded numbers in the upper part of the Instead, these predictors gauge aspects of a words lexi-
table highlight the very high Type I error rates for mod- cal properties. Furthermore, for many current studies it
els that ignore by-item variability that is actually present is unlikely that they fully exhaust all properties that co-
in the data. The only models that come close to the nom- determine lexical processing. In these circumstances, it is
inal Type I error rates are the mixed-eects model with highly likely that there is a non-negligible residue of
crossed random eects for subject and item, and the item-bound properties that are not brought into the
by-item regression. The lower half of Table 9 shows that model formulation. Hence a random eect for word
of these three models, the power of the mixed-eects should be considered seriously. Fortunately, mixed-
model is consistently greater than that of the by-item eects models allow the researcher to explicitly test
regression. (The greater power of the by-subject models, whether a random eect for Item is required by means
shown in grey, is irrelevant given their unacceptably of a likelihood ratio test comparing a model with and
high Type I error rates.) without a random eect for item. In our experience, such
Of the two mixed-eects models, it is only the model tests almost invariably show that a random eect for
with crossed random eects that provides correct esti- item is required, and the resulting models provide a tigh-
mates of the standard deviations characterizing the ran- ter t to the data.
dom eects, as shown in Table 10. When the item Mixed-eects regression with crossed random eects
random eect is ignored (lmerS), the standard deviation for participants and items have further advantages to
of the residual error is overestimated substantially, and oer. One advantage is shrinkage estimates for the
R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 407

BLUPs (the subject and item specic adjustments to inter- negative. In this situation, the main eect (for a numeric
cepts and slopes), which allow enhanced prediction for predictor or a binary contrast) will not be signicant, in
these items and subjects (see, e.g., Baayen, 2008, for fur- contradistinction to the signicance of the random eect
ther discussion). Another important advantage is the for the slopes or contrasts at issue. In this situation,
possibility to include simultaneously predictors that there is a real and potentially important eect, but aver-
are tied to the items (e.g., frequency, length) and predic- aged across subjects or items, it cancels out to zero.
tors that are tied to participants (e.g., handedness, age, In the eld of memory and language, experiments
gender). Mixed-eects models have also been extended that do not yield a signicant main eect are generally
to generalized linear models and can hence be used e- considered to have failed. However, an experiment
ciently to model binary response data such as accuracy resulting in this third state of aairs may constitute a
in lexical decision (see Jaeger, this volume). positive step forward for our understanding of language
To conclude, we briey address the question of the and language processing. Consider, by way of example,
extent to which an eect observed to be signicant in a a pharmaceutical company developing a new medicine,
mixed-eects analysis generalizes across both subjects and suppose this medicine has adverse side eects for
and items (see Forster, this issue). The traditional inter- some, but highly benecial eects for other patients
pretation of the F1 (by-subject) and F2 (by-item) analy- patients for which it is an eective life-saver. The com-
ses is that signicance in the F1 analysis would indicate pany could decide not to market the medicine because
that the eect is signicant for all subjects, and that the there is no main eect. However, they can actually make
F2 analysis would indicate that the eect holds for all substantial prot by bringing it on the market with
items. We believe this interpretation is incorrect. In fact, warnings for adverse side eects and proper distribu-
even if we replace the F1+F2 procedure by a mixed- tional controls.
eects model, the inference that the eect would general- Returning to our own eld, we know that no two
ize across all subjects and items remains incorrect. The brains are the same, and that dierent brains have dier-
xed-eect coecients in a mixed-eect model are esti- ent developmental histories. Although in the initial
mates of the intercept, slopes (for numeric predictors) stages of research the available technology may only
or contrasts (for factors) in the population for the aver- reveal the most robust main eects, the more our
age, unknown subject and the average, unknown item. research advances, the more likely it will become that
Individual subjects and items may have intercepts and we will be able to observe systematic individual dier-
slopes that diverge considerably from the population ences. Ultimately, we will need to bring these individual
means. For ease of exposition, we distinguish three pos- dierences into our theories. Mixed-eect models have
sible states of aairs for what in the traditional terminol- been developed to capture individual dierences in a
ogy would by described as an Eect by Item interaction. principled way, while at the same time allowing general-
First, it is conceivable that the BLUPs for a given xed- izations across populations. Instead of discarding indi-
eect coecient, when added to that coecient, never vidual dierences across subjects and items as an
change its sign. In this situation, the eect indeed gener- uninteresting and disappointing nuisance, we should
alizes across all subjects (or items) sampled in the exper- embrace them. It is not to the advantage of scientic
iment. Other things being equal, the partial eect of the progress if systematic variation is systematically ignored.
predictor quantied by this coecient will be highly
signicant.
Second, situations arise in which adding the BLUPs to Hierarchical models in developmental and educational
a xed coecient results in a majority of by-subject (or psychology
by-item) coecients that have the same sign as the pop-
ulation estimate, in combination with a relatively small Thus far, we have focussed on designs with crossed
minority of by-subject (or by-item) coecients with the random eects for subjects and items. In educational
opposite sign. The partial eect represented by the pop- and developmental research, designs with nested ran-
ulation coecient will still be signicant, but there will dom eects are often used, such as the natural hierarchy
be less reason for surprise. The eect generalizes to a formed by students nested within a classroom (Gold-
majority, but not to all subjects or items. Nevertheless, stein, 1987). Such designs can also be handled by
we can be condent about the magnitude and sign of mixed-eects models, which are then often referred to
the eect on average, for unknown subjects or items, if as hierarchical linear models or multilevel models.
the subjects and items are representative of the popula- Studies in educational settings are often focused on
tion from which they are sampled. learning over time, and techniques developed for this
Third, the by-subject (or by-item) coecients type of data often attempt to characterize how individu-
obtained by taking the BLUPs into account may result als performance or knowledge changes over time,
in a set of coecients with roughly equal numbers of termed the analysis of growth curves (Goldstein, 1987,
coecients that are positive and coecients that are 1995; Goldstein et al., 1993; Nutall, Goldstein, Prosser,
408 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412

& Rasbash, 1989; Willet, Singer, & Martin, 1998). tian, 1990). For a recent application of the software con-
Examples of this include the assessment of dierent sidered here to item response theory, see Doran, Bates,
teaching techniques on students performance (Aitkin, Bliese, and Dowling (2007), and for the application of
Anderson, & Hinde, 1981), and the comparison of the hierarchical models to joint response type and response
eectiveness of dierent schools (Aitkin & Longford, time measures, see Fox, Entink, and van der Linden
1986). Goldstein et al. (1993) used multilevel techniques (2007).
to study the dierences between schools and students
when adjusting for pre-existing dierences when students
entered classes. For a methodological discussion of the Mixed-eects models in neuroimaging
use of these models, see the collection of articles in the
Summer 1995 special issue of Journal of Educational and In neuroimaging, two-level or mixed eects models
Behavioral Statistics on hierarchical linear models, e.g., are now a standard analysis technique (Friston et al.,
Kreft (1995). Singer (1998) provides a practical introduc- 2002a, 2002b; Worsley et al., 2002), and are used in con-
tion to multilevel models including demonstration code, junction with Gaussian Random Field theory to make
and Collins (2006) provides a recent overview of issues inferences about activity patterns in very large data sets
in longitudinal data analysis involving these models. (voxels from fMRI scans). These techniques are formally
Finally, Fielding and Goldstein (2006) provide a compre- comparable to the techniques that are advocated in this
hensive overview of multilevel and cross-classied models paper (Friston, Stephan, Lund, Morcom, & Kiebel,
applied to education research, including a brief software 2005). Interestingly, however, the treatment of stimuli
review. West et al. (2007) provide a comprehensive soft- as random eects has not been widely addressed in the
ware review for nested mixed-eects models. imaging and physiological community, until recently
These types of models are also applicable to psycho- (Bedny, Aguirre, & Thompson-Schill, 2007).
linguistic research, especially in studies of developmental In imaging studies that compare experimental condi-
change. Individual speakers from a language community tions, for example, statistical parameter maps (SPM; Fris-
are often members of a hierarchy, e.g., language:dia- ton et al., 1995) are calculated based on successively
lect:family:speaker, and many studies focus on learning recorded time series for the dierent experimental condi-
or language acquisition, and thus analysis of change or tions. A hypothesized hemodynamic response function
development is important. Huttenlocher, Haight, Bryk, is convolved with a function that encodes the experimen-
and Seltzer (1991) used multilevel models to assess the tal design matrix, and this forms a regressor for each of
inuence of parental or caregiver speech on vocabulary the time series in each voxel. Signicant parameters for
growth, for example. Boyle and Willms (2001) provide the regressors are taken as evidence of activity in the
an introduction to the use of multilevel models to study voxels that exhibit greater or less activity than is
developmental change, with an emphasis on growth expected based on the null hypothesis of no activity dif-
curve modeling and discrete outcomes. Raudenbush ference between conditions. The logic behind these tests
(2001) reviews techniques for analyzing longitudinal is that a rejection of the null hypothesis for a region is
designs in which repeated measures are used. Recently, evidence for a dierence in activity in that region.
Misangyi, LePine, Algina, and Goeddeke (2006) com- Neuroimaging designs are often similar to cognitive
pared repeated measures regression to multivariate psychology designs, but the dimension of the response
ANOVA (MANOVA) and multilevel analysis in research variable is much larger and the nature of the response
designs typical for organizational and behavioral has dierent statistical properties. However, this is not
research, and concluded that multilevel analysis can pro- crucial for the application of mixed eects models. In
vide equivalent results as MANOVA, and in cases where fact, it shows the technique can scale to problems that
specic assumptions about variance-covariance struc- involve very large datasets.
tures could be made, or in cases where missing values A prototypical case of a xed eects analysis in fMRI
were present, that multilevel modeling is a better analy- would test whether a image contrast is statistically sig-
sis strategy and in some cases a necessary strategy (see nicant within a single subject over trials. This would
also Kreft & de Leeuw, 1998 and Snijders & Bosker, be analogous to a psychophysics experiment using only
1999). a few participants, or a patient case study. For random
Finally, a vast body of work in educational psychol- eect analysis the parameters calculated from the single
ogy concerns test construction and the selection of test participants are used in a mixed model to test whether a
items (Lord & Novick, 1968). Although it is beyond contrast is signicant over participants, in order to test
the scope of this article to review this work, it should whether the contrasts reects a dierence in the popula-
be noted that work within generalizability theory (Cron- tion from which the participants were sampled. This is
bach, Gleser, Nanda, & Rajaratnam, 1972) has been analogous to how cognitive psychology experimenters
concerned with the problem of crossed subject and item treat mean RTs, for example. A common analysis strat-
factors using random eects models (Schroeder & Haks- egy is to calculate a single parameter for each participant
R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 409

in an RT study, and then analyze this data in (what in the way that a psycholinguist would present words and
the neuroimaging community is called) a random eects non-words to a participant in a lexical decision experi-
analysis. ment. The images presented in this experiment would
The estimation methods used to calculate the statisti- be a sample of all possible human faces. It is not contro-
cal parameters of these models include Maximum Like- versial that human participants are to be modeled as a
lihood or Restricted Maximum Likelihood, just as in the random variable in psychological experiments. Pictures
application of the multilevel models used in education of human faces are images of a random variable, pre-
research described earlier. One reason that these tech- sented as stimuli. Thus, it should be no source of contro-
niques are used is to account for correlation between versy that naturalistic face stimuli are also a random
successive measurements in the imaging time series. variable, and should be modeled as a random eect, just
These corrections are similar to corrections familiar to like participants. For the sake of consistency, if human
psychologists for non-sphericity (Greenhouse & Geisser, participants, faces, and speech are to be considered ran-
1958). dom variables, then objects, artifacts, and scenes might
Similar analysis concerns are present within electro- just as well be considered random variables (also pointed
physiology. In the past, journal policy in psychophysio- out by Raaijmakers, 2003).
logical research has dealt with the problems posed by Any naturalistic stimulus which is a member of a
repeated measures experimental designs by suggesting population of stimuli which has not been exhaustively
that researchers adopt statistical procedures that take sampled should be considered a random variable for
into account the correlated data obtained from these the purposes of an experiment. Note that random in this
designs (Jennings, 1987; Vasey & Thayer, 1987). Mixed sense means STOCHASTIC, a variable subject to probabilis-
eects models are less commonly applied in psychophys- tic variation, rather than randomly sampled. A random
iological research, as the most common techniques are sample is one method to draw samples from a popula-
the traditional univariate ANOVA with adjustments or tion and assign them to experimental condition. How-
multivariate ANOVA (Dien & Santuzzi, 2004), but some ever, stimuli may have stochastic characteristics
researchers have advocated them to deal with repeated whether or not they are randomly sampled or not. Par-
measures data. For example, Bagiella, Sloan, and Heit- ticipants have stochastic characteristics, as well, whether
jan (2000) suggest that mixed eects models have advan- they are randomly sampled or not. Therefore, the pres-
tages over more traditional techniques for EEG data ent debate about the best way to model random eects
analysis. of stimuli is wider than previously has been appreciated,
The current practice of psychophysiologists and neu- and should be seen as part of the debate over the use of
roimaging researchers typically ignores the issue of naturalistic stimuli in sensory neurophysiology as well
whether linguistic materials should be modeled with (Felsen & Yang, 2005; Ruse & Movshon, 2005).
xed or random eect models. Thus, while there are
techniques available for modeling stimuli as random
eects, it is not yet current practice in neuroimaging Concluding remarks
and psychophysiology to do so. This represents a tre-
mendous opportunity for methodological development We have described the advantages that mixed-eects
in language-related imaging experiments, as psycholin- models with crossed random eects for subject and item
guists have considerable experience in modeling stimulus oer to the analysis of experimental data.
characteristics. The most important advantage of mixed-eects mod-
Cognitive psychologists and neuroscientists might els is that they allow the researcher to simultaneously
reasonably assume that the language-as-a-xed-eect consider all factors that potentially contribute to the
debate is only a concern when linguistic materials are understanding of the structure of the data. These factors
used, given that most discussion to date has taken place comprise not only standard xed-eects factors typically
in the context of linguistically-motivated experiments. manipulated in psycholinguistic experiments, but also
This assumption is too narrow, however, because natu- covariates bound to the items (e.g., frequency, complex-
ralistic stimuli from many domains are drawn from ity) and the subjects (e.g., age, sex). Furthermore, local
populations. dependencies between the successive trials in an experi-
Consider a researcher interested in the electrophysiol- ment can be brought into the model, and the eects of
ogy of face perception. She designs an experiment to test prior exposure to related or identical stimuli (as in
whether an ERP component such as the N170 in response long-distance priming) can be taken into account as
to faces has a dierent amplitude in one of two face con- well. (For applications in eye-movement research, see
ditions, normal and scrambled form. She obtains a set of Kliegl, 2007, and Kliegl, Risse, & Laubrock, 2007).
images from a database, arranges them according to her Mixed-eects models may oer substantially enhanced
experimental design, and proceeds to present each pic- insight into how subjects are performing in the course
ture in a face-detection EEG experiment, analogous to of an experiment, for instance, whether they are adjust-
410 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412

ing their behavior as the experiment proceeds to opti- on a fast server computer with substantial memory.
mize performance. Procedures requiring prior averaging Thanks to the possibility of handling very large data
across subjects or items, or procedures that are limited sets, we anticipate mixed-eects modeling to become
to strictly factorial designs, cannot provide the increasingly important for improved modeling of spatial
researcher with the analytical depth typically provided and temporal dependencies in neuroimaging studies, as
by a mixed-eects analysis. well as for the study of naturalistic corpus-based data
For data with not too small numbers of observations, in chronometric tasks and eye-movement research. In
mixed-eects models may provide modest enhanced short, mixed-eects modeling is emerging not only as a
power, as illustrated for a Latin Square design in the useful but also as an actually useable tool for coming
present study. For regression and analysis of covariance, to a comprehensive understanding of the quantitative
mixed-eects modeling protects against inated signi- structure of highly complex data sets.
cance for data sets with signicant by-item random
eects structure. Other advantages of mixed-eects mod-
eling that we have mentioned only in passing are the A note on parameter estimation
principled way in which non-independence (asphericity)
is handled through the variance-covariance structure of The mathematical details of model tting with mixed eects
the model, and the provision of shrinkage estimates models are beyond the scope of the present paper (see Bates,
for the by-subject and by-item adjustments to intercept 2007, for an introduction), we note here that tting the model
and slopes, which allows enhanced precision in involves nding the right balance between the complexity of
the model and faithfulness to the data. Model complexity is
prediction.
determined primarily by the parameters that we invest in the
An important property of mixed-eects modeling is random eects structure, basically the parameters that dene
that it is possible to t models to large, unbalanced data the relative variance-covariance matrix R in Eq. (10). Interest-
sets. This allows researchers to investigate not only data ingly, the proled deviance function, which is negative twice
elicited under controlled experimental conditions, but to the log-likelihood of model (10) evaluated at R, b ^ and r^2 for
also study naturalistic data, such as corpora of eye- a given set of parameters, can be estimated without having to
movement data. Markov chain Monte Carlo sampling ^ or ^b. The proled deviance function has two compo-
solve for b
from the posterior distribution of the parameters is an nents, one that measures model complexity and one that mea-
ecient technique to evaluate tted models with respect sures delity of the tted values to the observed data. This is
illustrated in Fig. 2.
to the stability of their parameters and to distinguish
Each panel has the relative standard deviation of the item
between robust parameters (with narrow highest poster-
random eect (i.e., ri/r) on the horizontal axis, and the relative
ior density intervals) from superuous parameters (with standard deviation of the subject random eect (rs/r) on the
very broad density intervals). vertical axis. First consider the rightmost panel. As we allow
Mixed-eects modeling is a highly active research these two relative standard deviations to increase, the delity
eld. Well-established algorithms and techniques for to the data increases and the deviance (the logarithm of the
parameter estimation are now widely available. One penalized residual sum of squares) decreases. In the contour
question that is still hotly debated is the appropriate plot, darker shades of grey represent greater delity and
number of degrees of freedom for the xed-eects fac- decreased deviance, and it is easy to see that a better t is
tors. Dierent software packages make use of or even obtained for higher values for the item and subject relative stan-
oer dierent choices. We have emphasized the impor- dard deviations. However, increasing these relative standard
deviations leads to a model that is more complex.2 This is
tance of Markov chain Monte Carlo sampling as fast
shown in the middle panel, which plots the contours of the
and ecient way (compared to, e.g., the bootstrap) to model complexity, the logarithm of the determinant of a matrix
evaluate a models parameters. In our experience, p-val- derived from the random eects matrix Z. Darker shades of
ues based on MCMC sampling and p-values based on grey are now found in the lower left corner, instead of in the
the upper bound of the degrees of freedom tend to be upper right corner. The left panel of Fig. 2 shows the compro-
very similar for all but the smallest samples. mise between model complexity and delity to the data in the
An important goal driving the development of the form of the deviance function that is minimized at the maxi-
lme4 package in R, the software that we have intro-
duced and advocated here, is to make it possible to deal
realistically with the parameters of models t to large,
unbalanced data sets. Bates (2007a) provides an example 2
The relation between model complexity and the magnitudes
of a data set with about 1.7 million observations, 55000
of the item and subject relative standard deviations is most
subjects (distinct students at a major university over a easily appreciated by considering the limiting case in which
5 year period) and 7900 items (instructors). The data both relative standard deviations are zero. These two param-
are unbalanced and the subject and item factors are par- eters can now be removed from the symbolic specication of the
tially crossed. Fitting a simple model with random model. This reduction in the number of parameters is the
eects for subject and for item took only about an hour familiar index of model simplication.
R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412 411

1 2 3 4

deviance 59 model complexity fidelity


166
45
16
60123
16
16
16
157
16
16
15
9
16 8
1585
1

24
4 1

22
3

20
2

7.
6
18
2

16

7.
8
14

8. 2
155

12 0

0
156

8. . 4
1

1 8
12

88.6
16
16
16345
16
16
1 2 3 4 6 1 2 3 4
1

Fig. 2. Contours of the proled deviance as a function of the relative standard deviations of the item random eects (x-axis) and the
subject random eects (y-axis). The leftmost panel shows the deviance, the function that is minimized at the maximum likelihood
estimates, the middle panel shows the component of the deviance that measures model complexity and the rightmost panel shows the
component of the deviance that measures delity of the tted values to the observed data.

mum likelihood estimates. The + symbols in each panel denote Psychology and Psychiatry and Allied Disciplines, 42,
the values of the deviance components at the maximum likeli- 141162.
hood estimates. Box, G. E. P., & Tiao, G. C. (1992). Bayesian inference in
statistical analysis. New York: Wiley.
Clark, H. H. (1973). The language-as-xed-eect fallacy: A
References critique of language statistics in psychological research.
Journal of Verbal Learning and Verbal Behavior, 12,
Aitkin, M., Anderson, D., & Hinde, J. (1981). Statistical 335359.
modeling of data on teaching styles. Journal of the Royal Coleman, E. B. (1964). Generalizing to a language population.
Statistical Society, A, 144, 148161. Psychological Reports, 14, 219226.
Aitkin, M., & Longford, N. (1986). Statistical modeling in Collins, L. M. (2006). Analysis of longitudinal data: The
school eectiveness studies. Journal of the Royal Statistical integration of theoretical models, design, and statistical
Society, A, 149, 143. model. Annual Review of Psychology, 57, 505528.
Alegre, M., & Gordon, P. (1999). Frequency eects and the Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N.
representational status of regular inections. Journal of (1972). The dependability of behavioral measurements. New
Memory and Language, 40, 4161. York: Wiley.
Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). Dien, J., & Santuzzi, A. M. (2004). Application of repeated
An introduction to MCMC for machine learning. Machine measures ANOVA to high-density ERP datasets: A review and
Learning, 50, 543. tutorial. In T. Handy (Ed.), Event-related potentials. Cam-
Baayen, R. H. (2008). Analyzing linguistic data: A practical bridge: MIT Press.
introduction to statistics. Cambridge: Cambridge University Doran, H., Bates, D., Bliese, P., & Dowling, M. (2007).
Press. Estimating the multilevel Rasch model with the lme4
Bagiella, E., Sloan, R. P., & Heitjan, D. F. (2000). Mixed-eects package. Journal of Statistical Software, 20, 118.
models in psychophysiology. Psychophysiology, 37, 1320. Faraway, J. J. (2006). Extending the linear model with R. Boca
Bates, D. M., & Sarkar, D. (2007). lme4: Linear mixed-eects Raton, FL: Chapman & Hall/CRC.
models using S4 classes, R package version 0.99875-6. Felsen, G., & Yang, D. (2005). A natural approach to studying
Bates, D. M. (2005). Fitting linear mixed models in R. R News, vision. Nature Neuroscience, 8, 16431646.
5, 2730. Fielding, A., & Goldstein, H. (2006). Cross-classied and
Bates, D. M. (2007). Linear mixed model implementation in multiple membership structures in multilevel models: An
lme4. Manuscript, university of Wisconsin - Madison, introduction and review. Research Report No. 791. Depart-
January 2007. ment of Eduction and Skills. University of Birmingham.
Bates, D. M. (2007a). Fitting linear, generalized linear and mixed ISBN 1 84478797 2.
models to large data sets. Paper presented at useR!2007, Forster, K. I., & Dickinson, R. G. (1976). More on the
Ames, Iowa, August 2007. language-as-xed eect: Monte-Carlo estimates of error
Bedny, M., Aguirre, G. K., & Thompson-Schill, S. L. (2007). rates for F1, F2, F0 , and minF0 . Journal of Verbal Learning
Item analysis in functional magnetic resonance imaging. and Verbal Behavior, 15, 135142.
Neuroimage, 35, 10931102. Fox, J., Entink, R., & van der Linden, W. (2007). Modeling of
Boyle, M. H., & Willms, J. D. (2001). Multilevel modeling of responses and response times with the package CIRT. Journal
hierarchical data in developmental studies. Journal of Child of Statistical Software, 20, 114.
412 R.H. Baayen et al. / Journal of Memory and Language 59 (2008) 390412

Friston, K. J., Glaser, D. E., Henson, R. N. A., Kiebel, S., Lord, F. M., & Novick, M. R. (1968). Statistical theories of
Phillips, C., & Ashburner, J. (2002a). Classical and Bayesian mental test scores. Reading, MA: Addison-Wesley.
inference in neuroimaging: Applications. NeuroImage, 16, Misangyi, V. F., LePine, J., Algina, J., & Goeddeke, F. (2006).
484512. The adequacy of repeated-measures regression for multi-
Friston, K. J., Holmes, A. P., Worsley, K. J., Poline, J.- level research. Organizational Research Methods, 9, 528.
B., Frith, C. D., & Frackowiak, R. S. J. (1995). MLwiN 2.1 (2007). Centre for Multilevel Modeling, University
Statistical parametric maps in functional imaging: A of Bristol, http://www.cmm.bristol.ac.uk/MLwiN/
general linear approach. Human Brain Mapping, 2, index.shtml.
189210. Nutall, D., Goldstein, H., Prosser, R., & Rasbash, J. (1989).
Friston, K. J., Penny, W., Phillips, C., Kiebel, S., Hinton, G., & Dierential school eectiveness. International Journal of
Ashburner, J. (2002b). Classical and Bayesian inference in Educational Research, 13, 769776.
neuroimaging: Theory. NeuroImage, 16, 465483. Pinheiro, J. C., & Bates, D. M. (2000). Mixed-eects models in S
Friston, K. J., Stephan, K. E., Lund, T. E., Morcom, A., & and S-PLUS. New York: Springer.
Kiebel, S. (2005). Mixed-eects and fMRI studies. Neuro- Quene, H., & Van den Bergh, H. (2004). On multi-level
Image, 24, 244aAS 252aAS
. modeling of data from repeated measures designs: A
Gardner, M. K., Rothkopf, E. Z., Lapan, R., & Laerty, T. tutorial. Speech Communication, 43, 103121.
(1987). The word frequency eect in lexical decision: Raaijmakers, J. G. W., Schrijnemakers, J. M. C., & Gremmen,
Finding a frequency-based component. Memory and Cog- F. (1999). How to deal with the language-as-xed-eect-
nition, 15, 2428. fallacy: Common misconceptions and alternative solutions.
Gilmour, A. R., Thompson, R., & Cullis, B. R. (1995). AI, an Journal of Memory and Language, 41, 416426.
ecient algorithm for REML estimation in linear mixed Raaijmakers, G. (2003). A further look at the language-as-a-
models. Biometrics, 51, 14401450. xed-eect fallacy. Canadian Journal of Experimental Psy-
Gilmour, A. R., Gogel, B. J., Cullis, B. R., Welham, S. J., & chology, 57, 141151.
Thompson, R. (2002). ASReml User Guide Release 1.0. VSN Raudenbush, S. W. (2001). Comparing personal trajectories
International, 5 The Waterhouse, Waterhouse St, Hemel and drawing causal inferences from longitudinal data.
Hempstead, HP1 1ES, UK. Annual Review of Psychology, 52, 501525.
Goldstein, H. (1987). Multilevel models in educational and social R development core team (2007). R: A language and environ-
research. London: Grin. ment for statistical computing. Vienna: R Foundation for
Goldstein, H., Rasbash, J., Yang, M., Woodhouse, G., Pan, H., Statistical Computing, http://www.R-project.org.
Nuttall, D., et al. (1993). A multilevel analysis of school Ruse, N. C., & Movshon, J. A. (2005). In praise of artiface.
examination results. Oxford Review of Education, 19, Nature Neuroscience, 8, 16471650.
425433. Schroeder, M. L., & Hakstian, A. R. (1990). Inferential
Goldstein, H. (1995). Multilevel statistical models. London: procedures for multifaceted coecients of generalisability.
Arnold. Psychometrika, 55, 429447.
Greenhouse, S. W., & Geisser, S. (1958). On methods in the Singer, J. (1998). Using SAS PROC MIXED to t multilevel models,
analysis of prole data. Psychometrika, 24, 95112. hierarchical models, and residual growth models. Journal of
Huttenlocher, J., Haight, W., Bryk, A., & Seltzer, M. (1991). Educational and Behavioral Statistics, 23, 323355.
Early vocabulary growth: Relation to language input and Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel analysis.
gender. Developmental Psychology, 27, 236249. London: Sage Publications.
Jennings, J. R. (1987). Editorial policy on analysis of Vaan, L. de, Schreuder, R., & Baayen, R. H. (2007). Regular
variance with repeated measures. Psychophysiology, 24, morphologically complex neologisms leave detectable traces
474478. in the mental lexicon. The Mental Lexicon, 2, 123.
Kliegl, R. (2007). Toward a perceptual-span theory of distrib- Vasey, M. W., & Thayer, J. F. (1987). The continuing problem
uted processing in reading: A reply to Rayner, Pollatsek, of false positives in repeated measures ANOVA in psycho-
Drieghe, Slattery, & Reichle (2007). Journal of Experimental physiology: A multivariate solution. Psychophysiology, 24,
Psychology: General, 136, 530537. 479486.
Kliegl, R., Risse, S., & Laubrock, J. (2007). Preview benet and West, B. T., Welch, K. B., & Galechki, A. T. (2007). Linear
parafoveal-on-foveal eects from word n+2. Journal of mixed models. A practical guide using statistical software.
Experimental Psychology: Human Perception and Perfor- Boca Raton: Chapman & Hall/CRC.
mance, 33, 12501255. Willet, J. B., Singer, J. D., & Martin, N. C. (1998). The design
Kreft, G. G. I. (1995). Hierarchical linear models: Problems and and analysis of longitudinal studies of development and
prospects. Journal of Educational and Behavioral Statistics, psychopathology in context: Statistical models and meth-
20, 109113. odological recommendations. Development and Psychopa-
Kreft, I., & de Leeuw, J. (1998). Introducing multilevel modeling. thology, 10, 395426.
London: Sage. Winer, B. J. (1971). Statistical principles in experimental design.
Lorch, R. F., & Myers, J. L. (1990). Regression analyses of New York: McGraw-Hill.
repeated measures data in cognitive research. Journal of Worsley, K. J., Liao, C., Aston, J., Petre, V., Duncan, G. H., &
Experimental Psychology: Learning, Memory, and Cogni- Evans, A. C. (2002). A general statistical analysis for fMRI
tion, 16, 149157. data. NeuroImage, 15, 115.

Вам также может понравиться