Multilevel Analysis

MULTILEVEL ANALYSIS An introduction to basic and advanced multilevel modeling | Tom A. B. Snijders and Roel J. Bosker | 6 | SAGE Publications | London + Thousand Oaks « New Delhi i© © @ TomA. B. Snllders and Roa J. Bosker 1999 Pirst published 1999 Reprinted 2000, 2002, 2003, 1 ‘ontents All rights reserved. No part of this publication may be reproduced, stored in a e retrieval system, transmitted or utilized in any form or by any means, electronic, ‘mechanical, photocopying, recording or otherwise, without permission in writing from the Publishers. Preface SAGE Publications Ltd 6 Bonkill Street London EC2A 4PU SAGE Publications Ine 2455 Teller Road ‘Thousand Oaks, California 91320 SAGE Publications India Pvt Ltd 32, M-Block Market Greater Kailash - 1 New Delhi 110 048 03) SA46 British Library Cataloguing in Publication data A catalogue revord for this book is available from the British Library ISBN 0-7619-5880-4 ISBN 0-7619-5890-8 (pbk) Library of Congress catalog record available Printed in Great Britain by The Cromwell Press Ltd, Trowbridge, Wiltshire Bae AP Sulbes etsicuunieeba uta 1 Introduction 1.1 Multilevel analysis, LL. Probability models 1.2. This book 121 Prerequisites 1,22 Notation 2 Mulilevel Theories, Multi-stage Sampling, and Multilevel Models 2.1 Dependence as a nuisance 2.2 Dependence as an interesting phenomenon 2.3. Macto-level, micro-level, and cross-level relations 3. Statistical Treatment of Clustered Data 3.1 Aggregation 32 Disaggregation 3.3. The intraclass correlation 33.1 Within-group and between-group variance 3.3.2 Testing for group differences 3.4 Design effects in two-stage samples 3.5 Reliability of aggregated variables 3.6 Within- and between-group relations 3.6.1 Regressions 3.62 Correlations 3.6.3 Estimation of within- and between-group correlations 3.7 Combination of within-group evidence 4 The Random Intercept Model 4.1 A regression model: fixed effects only 4.2 Variable intercepts: fixed or random parameters? 4.2.1 When to use random coefficient models? 43. Definition of the random intercept model 44° More explanatory variables 4.5 Within- and between-group regressionsContents 46 Parameter estimation 56 4.7 ‘Estimating’ random group effects: posterior means 58 4.7.1 Posterior confidence intervals 60 48 Threelevel random intercept models, 63 ‘The Hierarchical Linear Model er 5.1 Random slopes 67 5.l.l Heteroscedasticity 68 5.1.2 Don't force m, to be 0! 69 5.1.3 Interpretation of random slope variances 70 5.2 Explanation of random intercepts and slopes 2 5.2.1 Cross-level interaction effects 3 5.2.2 A general formulation of fixed and random parts 9 5.3. Specification of random slope models 80 5.3.1 Centering variables with random slopes? 80 54 Estimation 82 5.5. Three and more levels 83 ‘Testing and Model Specification 86 6.1 Tests for fixed parameters 86 6.1.1 Multi-parameter tests for fixed effects 88 6.2 Deviance tests 88 6.2.1 Halved p-values for variance parameters 90 6.3 Other tests for parameters in the random part a 6.4 Model specification 1 64.1 Working upward from level one 94 6.4.2 Joint consideration of level-one and level-two variables 96 643 Concluding remarks about model specification 97 How Much Does the Model Explain? 99 7.1 Explained variance 99 7.1 Negative values of R2? 99 7.1.2. Definitions of proportions of explained variance in two-level models 101 7.1.3 Explained variance in three-level models 104 7.14 Explained variance in models with random slopes 104 7.2, Components of variance 105 7.2.1 Random intercept modes 108 7.22 Random slope models 108 Heteroscedasticity 110 8.1 Heteroscedasticity at level ane 110 8.1.1 Linear variance functions 110 8.1.2 Quadratic variance funetions 4 8.2 Heteroscedasticity at level two 19 Contents 9 Assumptions of the Hierarchical Linear Model. 9.1 Assumptions of the hierarchical linear model 9.2 Following the logic of the hierarchical linear model 9.2.1 Include contextual effects 9.2.2 Check whether variables have random effects 9.2.3 Explained variance 9.3 Specification of the fixed part 9.4 Specification of the random part 9.4.1 Testing for heteroscedasticity 9.4.2 What to do in case of heteroscedasticity 9.5 Inspection of level-one residuals 9.6 Residuals and influence at level two 9.6.1 Empirical Bayes residuals 9.6.2 Influence of level-two units 9.7 More general distributional assumptions 10 Designing Multilevel Studies 10.1 Some introductory notes on power 10.2 Estimating a population mean 10.3 Measurement of subjects 10.4 Estimating association between variables 10.4.1 Cross-level interaction effects 10.5 Exploring the variance structure 10.5.1 The intraclass correlation 10.5.2 Variance parameters 11 Crossed Random Coefficients 11.1 A two-level model with a crossed random factor 11.1.1 Random slopes of dummy variables 11.2 Crossed random effects in three-level models 113 Correlated random coefiicients of crossed factors 11.3.1 Random slopes in a crossed design 113.2 Multiple roles 11.3.3. Social networks 12 Longitudinal Data 121 Fixed occasions 12.1.1 The compound symmetry model 12.1.2 Random slopes 12.13 The fully multivariate model 12.14 Multivariate regression analysis 12.15 Explained variance 12.2 Variable ovcesion designs 12.2.1 Populations of curves 12.2.2. Random functions 12.2.8. Fixnlaining the finctions vil 120 120 121 122 122 123, 124 125 126 128 128 132 132 134 139 140 141 142 143 144 148 151 151 154 155 155 156 159 160 160 161 162 166 167 168 171 173 178 179 isi 181 182 193Contents 12.24 Changing coveriates i | 123 Autocorrlated residuals 199 | 13 Multivariate Multilevel Models 200 13.1 The multivariate random intercept model 201 2 132 Mukivariate random slope models 206 4 Preface 14 Discrete Dependent Variables 207 14.1 Hierarchical generalized linear models 207 | 14.2 Introduction to multilevel logistic regression 208 qa . 14.2.1 Heterogeneous proportions 208 ‘This book grew out of our teaching and consultation activities in the 14.22 The logit function: Log-odds a domain of multilevel analysis. Is intended as wel forthe absolute beginner 1423 The empty model 213 Sn this field as for those who have already mastered the fundamentals and are 1424 The random intercept model as now entering more complicated areas of application. The reader is refered 142.5 Estimation ne to Section 1.2 for an overview ofthis book and for some reading guidelines. 14.26 Aggregation 219 ‘We are grateful to various people from whom we got reactions on earier 14.27 Testing the random intercept 220 1 parts of this manuscript and also to the students who were exposed to it and 1433 Further topics about multilevel logistic regression 220 helped us realize what was unclear. We received useful comments and bene- 143.1 Random slope model 220 fited from discussions about parts of the manuscript with, among others, He a) Repensert ation 25» cisesinld motel =) Joerg Blasius, Maritje van Duija, Wolfgang Langer, Ralf Maslowsk', and 143.3 Residual intraclass correlation coefficient ma J Tan Plewis. Moreover ve would lke to thank Hennie Brandaa, Mieke 143.4 Explained variance 25 rekeans, Jan van Damme, Hetty Dekiers, Miranés Lubbers Lyset 1435 Consequences of adding effects to the model 227 | Rekes-Mombarg end Jan Maarten Wit, Carolina de Weerth, Beate Volker, 14.36 Bibliographic remarks 229 \ Ger van der Werf, and the Zentral Archiv (Cologne) who kindly permitted 144 Ordered categorical variables 29 1s to use data from their respective research projects as iustrative material 145 Multilevel Poisson regression 234 i for this book, We woald also lke to thank Annelies Verstappen-Remmers 15 Software 239 | for her unfailing secretarial assistance 1511 Special software for multilevel modeling 239 15.1.1 HLM 240 i 15.1.2 Min / MLwiN 243 | Tom Snijders 151.3 VARCL 245 Roel Bosker 151.4 MIXREG, MIXOR, MIXNO, MICPREG 27 \ Te, 1999 15.2 Modules in general purpose software packages 248 15.21 SAS, procedure MIXED 28 182.2 SPSS, command VARCOMP 249 | 1523 BMDP-V modules 250 1524 Stata 250 153 Other multilevel software 251 153.1 PinT 251 | 15.3.2 Mplus 251 | 1533 MLA 251 1534 BUGS 251 References 252 | Index 2611 1 Introduction 1.1 Multilevel analysis Multilevel analysis is @ methodology for the analysis of data with complex patterns of variability, with a focus on nested sources of variability: e-g., pupils in classes, employees in firms, suspects tried by Judges in courts, animals in litters, longitudinal measurements of subjects, etc. In the analysis of such data, it usually is illuminating to take account of the variability associated with each level of nesting. There is variability, eg., between pupils but also between classes, and one may draw wrong conclusions if either of these sources of variability is ignored. Multilevel analysis is an approsch to the analysis of such data including the statistical techniques as well as the methodology of how to use these. The name of inultilevel analysis is used mainly in the social sciences (in the wide sense: sociology, education, psychology, economies, criminology, etc). but also in other fields such as the bio-medical sciences. Our focus will be on the social sciences. In its present form, multilevel analysis is a stream which has two tribu- tatles: contextual analysis and mixed effects models. Conteztual analysis is a development in the social sciences which has focused on the effects of the social context on individual behavior. Some landmarks before 1980 were the paper by Robinson (1950) who discussed the ecological fallacy (which refers to confusion between aggregate and individual effects), the paper by Davis, Spaeth, and Huson (1961) about the distinction between within-group and between-group regression, the volume edited by Dogan and Rokkan (1969), and the paper by Burstein, Linn, and Capell (1978) about treating regression intercepts and slopes on one level as outcomes on the higher level. ‘Mized effects models are statistical models in the analysis of variance and in regression analysis where it is assumed that some of the coeficients are Gixed and others are random. This subject is too vast even to mention some landmarks, The standard -wference book on random effects models ‘and mixed effects models is Seurle, Casella, and McCulloch (1992), who sive an extensive historical overview in their Chapter 2. The name ‘mixed ‘mode!’ seems to have been used first by Eisenhart (1947). Contextual modeling until about 1980 focused on the definition of appropriate variables to be wed in ordinary least squares regression analysis. The main focus in the development of statistical procedures for mixed 12 Introduction models was until the 1980s on random effects (ie, random differences between classes in some classification system) more than on random coeficients (i., random effects of numerical variables). Multilevel analysis as we now know it was formed by these two streams coming together. It ‘was realized that in contertual modeling, the individual and the context are distinct sources of variability, which should both be modeled as random influences. On the other hand, statistical methods and algorithms ‘were developed that allowed the practical use of regression-type models ‘with nested random coeficients. ‘There was 2 cascade of statistical papers: Aitkin, Anderson, and Hinde (1981); Laird and Ware (1982); Ma- son, Wong, and Entwisle (1983); Goldstein (1986); Aitkin and Longford (1986); Raudenbush and Bryk (1986); De Leeuw and Kreft (1986), and Longford (1987) proposed and developed techniques for calculating est- rates for mixed models with nested coefficients. These techniques, together with the programs implementing them which were developed by a number of these researchers or under their supervision, allowed the practical use of models of which until that moment only special cases were accessible for practical use. By 1986 the basis of multilevel analysis was established, ‘many farther elaborations have been developed since then, and the method ology has proved to be quite fruitful for applications. On the organizational side, the ‘Multilevel Models Project’ in London stimulates developments by its Newsletter and its web site htp://wnw.oe.ac.uk/mutlevel/ with the mirror web sites http://www medent.umontreal.ca/mukilevel/ and also beep: /wowedfac.unimelb.edu au/mutilevel/. In the biomedical sciences mixed models were proposed especially for longitudinal data; in economics mainly for panel data (Swatoy, 1971), the most common longitudinal data in economics. One of the issies treated in the economic literature was the pooling of cross-sectional and time series data (eg, Madalla, 1971 and Hausman and Taylor, 1981), which ts closely related to the diference between within-group and between-group regressions. Overviews aze given by Chow (1984) and Baltagi (1995). ‘A more elaborate history of multilevel analysis is presented in the bibi- graphical sections of Longford (1993a) and in Kreft and de Leeuw (1998). For an extensive bibliography, see Hiittner and van den Eeden (1996). 1.1.1 Probability models ‘The main statistical model of multilevel analysis is the hierarchical linear ‘model, an extension of the multiple liar regression model to a model that includes nested random coefficients. This model is explained in Chapter 5 ‘and forms the basis of mast of this book. ‘There are several ways to argue why it makes sense to use a probability ‘model for data analysis. In sampling theory a distinetion is made between design-based inference and model-based inference (se, e., Sirndal, Swens- sea, and Wretman, 1901). The former means that the researcher draws a probability sample from some finite population, and wishes to make inferences from the sample to this finite population. The probability model This book 3 then follows from how the sample e drawn by the researcher. Model-based inference means that the researcher postulates a probability model, usually aiming at inference to some large and sometimes hypothetical population like all English primary school pupils in the 1990s or all human adults living ina present-day industrialized culture. Ifthe probability model is adequate then so are the inferences based on it, bt checking this adequacy is possible only toa limited extent Tt is possible to apply model-based inference to data collected by investigating some entire research population, like all twelve-year-old pupils in Amsterdam at a given moment. Sometimes the question is posed why one should use a probability model if no sample is drawn but an entire population is observed. Using a probability model that assumes statistical Yarlabilty, even though an entire research population was investigated, cam ‘be justified by realising that conclusions are sought which apply not only to the investigated research population but to wider population. The investigated research population is supposed to be representative for this ‘wider population ~ for pupils also in earlier or later years, is other towns, Imaybe in other countries. Applicability to such a wider population is not ‘automatic, but has to be carefully argued by considering whether indeed the research population may be considered to be representative for the larger {often vaguely outlined) population. The inference then is not primarily about a given delimited set of individuals but about social, behavioral, bio- logieal, ete, mechanisms and processes. The random effets, or residuals, playing a role in such probability models can be regarded as the resultants Of the factors that are not included in the explanatory variables used. They reflect the approximating nature of the model used. ‘The model-based infer- tence will be adequate to the extent that the assumptions of the probability ‘model are an adequate reflection of the effects that are not explicitly included by means of observed variables. 'As we shall see in Chapters 3, 4, and 6, the basic idea of multilevel analysis is that data sets with a nesting structure that includes unexplained variability at each level of nesting, such as pupil in classes or employees in firms, are usually not adequately represented by the probability model of ‘multiple linear regression analysis, but are often adequately represented by ‘the hierarchical linear model. TIhus, the use ofthe hierarchical linear model in multilevel analysis is in the tradition of model-based inference. 1.2 This book ‘This book is meant as an introductory textbook and as a reference book for practical users of multilevel analysis. We have tried to include all the main points that come up when applying multilevel analysis. Some of the data sets used in the examples, and corresponcing commands to run the examples in the computer packages MLn/MLwiN and HLM (sec Chapter 16), are ‘available at the web site http://stat.gamma.cug.nl/snijders/multlevel.btem. ‘After this introductory chapter, the book procecds with a conceptual4 Introduction chapter about multilevel questions and a chapter about ways for treating ‘multilevel data that are not based on the hierarchical linear model. Chapters 4106 treat the basic conceptual ideas of the hierarchical linear model, and hhow to wotk with it in practice, Chapter 4 introduces the random intercept model as the primary example of the hierarchical Linear model. This is extended in Chapter § to random slope models. Chapters 4 and § focus on understanding the hierarchical inear model and its parameters, paying only very limited attention to procedures and algorithms for parameter tstimation (estimation being work that most researchers delegate to the computer). Testing parameters and specifying a multilevel model is the topic of Chapter 6. ‘An introductory course on multilevel analysis could cover Chapters 1 to 6 and Section 7.1, with selected material from other chapters. A minimal course would focus on Chapters 4 to 6. The later chapters are about topics which are more specialized or more advanced, but important in the practice cof multilevel analysis ‘The text of this book is not based on a particular computer program for rultilevel analysis. The last chapter, 15, gives a brief review of programs ‘that can be used for multilevel analysis and makes the link (to the extent ‘that this is still necessary) between the terminology used in these programs and the terminology of the book. Chapters 7 (about the explanatory power of the model) and 9 (about ‘model assumptions) are important for the interpretation of reults of sta- jstcal analyses using the hierarchical linear model. Chapter 10 helps the researcher in setting up a multilevel study, and in choosing sample sizes at the various levels. Chapters 8, and 11 to 14, treat various extensions of the basic hierarchical linear model that are useful in practical research. ‘The topic of Chapter 8, heteroscedasticity (non-constant residual variances), may seem rather specialized. Modeling hetercscedasticity, however, can be very use- fal. Tt also allows model checks and model modifications that are used in Chapter 9. Chapter 11 treats crossed random cosffcints, a model ingre dient which strictly speaking is outside the domain of multilevel models, Dbut which is practically important and can be implemented in currently available multilevel sofware. Chapter 12 is about longitudinal data, with a fixed occasion design (i, repeated measures data) aa well as those with a variable occasion design. TThis chapter indicates how the flexibility of the multilevel model gives important opportunities for data analysis (€4,, for Incomplete multivariate of longitudinal data) that were unavailable earlier. Chapter 13 is about multilevel analysis for multivariate dependent variables. Chapter 14 describes possibilities of multilevel modeling for dichotomous, ordinal, and frequency data. If additional textbooks are sought, one could consider Hox (1994) and Kreft and de Leeuw (1998), good introductions; Bryk and Raudenbush (1003), an elaborate treatment of the bicrarchical linear model; and Long- ford (19992) and Goldstein (1995) for more ofthe mathematical background. ‘This book 5 121 Prerequisites For reading this textbook, its required that you have a good working know edge of statistics. It is assumed that you know the concepts of probability, random variable, probability distribution, population, sample, statistical independence, expectation (= population mean), variance, covariance, correlation, standard deviation, and standard error. Further itis assumed that ‘you know the basics of hypothesis testing and multiple regression analysis, and that you car understand formulae of the kind that occur in the explanation of regression analysis. Matrix notation is used only in a few more advanced sections. These sections can be skipped without loss of understanding of other parts of the book. 1.22 Notation ‘The main notational conventions are as follow ‘Abstract variables and random variables are denoted by italicized capital letters, ike X or Y. Outcomes of random variables and other fixed values are denoted by italicized small letters, ke z or . Thus we speak about the variable X, but in formulae where the value ofthis variable is considered as a fixed, non-random ral, it will be denoted 2. There are some exceptions to this, eg, in Chapter 2, and the use of the letter V for the number of groups (levelstwo unks!) in the data. ‘The letter € is used to denote the expected value, or population average, of a random variable. Thus, EY and E(Y) denote the expected value of Y. For example, if Py isthe fraction of tals obtained in n coin fips, and the coin is fair, then the expected value is EP, =} ‘Statistical parameters are indicated by Gresk letters. Examples are h, 04, and B. The folowing Greek letters are used. alpha beta gamma delta eta theta lambda Pi ho sigms tau phi chi omega capital sigma capital wu HMEX ST gRARY eS o2ue2 Multilevel Theories, Multi-stage Sampling, and Multilevel Models In many cases simple random sampling is not a very cost-efficient strategy, and multi-stage samples may be more efficient instead. In that case the clustering of the data is, in the phase of data analysis, « nuisance which should be taken into consideration. In many other situations, however, ‘multi-stage samples are employed because one is interested in relations between variables at different layers in a hierarchical system. In this case the dependency of observatiors within groups is of focal interest, because it reflects that groups differ iz certain respects, In either case, the use of single-level statistical models is no longer valid. The fallacies to which their ‘use can lead are described in the next chapter. 2.1 Dependence as a nuisance From textbooks on statistics itis learned that, as @ standard situation, observations should be sampled independently from each other. The standard sampling design to which statistical models are linked accordingly is simple random sampling with replacement from an infinite population: the result of one selection is independent of the result of any other selection, and the chances of selecting a certain single unit are constant (and known) across all units in the population. ‘Textbooks on sampling, however, make clear ‘that there are more cost-efficient sampling designs, based on the idea that robabilities of selection should be known but do not have to be constant. One of those cost-efficient sampling designs is the multi-stage sample: the Population of interest consists of subpopulations, and selection takes place ‘via those subpopulations. If there is only one subpopulation level, the design is a two-stage sample. Pusils, or instance, are grouped in schools, 0 the population of pupils consists of subpopulations of schools that contain Pupils. Other examples are: families in neighborhoods, teeth in jawbone, animals in litters, employees in irms, children in families, etc. In a random two-stage sample, a random saxple of the primary units (schools, neighborhoods, jawbones litters, firms, families) ia taken in the fist stage, and then the secondary units (pupils families, teeth, animals, employees, children) Dependence as an interesting phenomenon 1 are sampled at random from the selected primary units in the second stage. ‘A common mistake in research is to ignore the fact that the sampling scheme ‘was a two-stage one, and to pretend that the secondary units were selected independently. The mistake in this case would be, that the researcher over- looks the fact that the secondary units were not sampled independently from each other: having selected a primary unit (a school, for example) Increases the chances of selection of secondary units (pupils, for example) from that primary unit. Stated otherwise: the multistage sampling design leads to dependent observations. ‘The multi-stage sampling design can be graphically depicted as In Figure 2.1. In this figure we see a population that ° o%980 + selected unit Doge .: Bot selected unit Figure 2.1 Multistage sampling consists of 10 subpopulations, each containing 10 micro-units. A sample of 2 pore taken by randy sling 8 ost of 10 eobpopuations and ‘within these ~ again at random of course ~ § out of 10 micro-units. ‘Multistage samples are preferred in practice, because the costs of interviewing oF testing persons are reduced enormously if these persons are seographically ot organizationally grouped. It is cheaper to travel to 100 neighborhoods and interview ten persons per neighborhood on their political preferences than to travel to 1,000 neighborhoods and interview one person per neighborhood. In the next chapters we will see how we can make adjustments to deal with these dependencies. 2.2 Dependence as an interesting phenomenon eee Ee ean el eee ete Se ee eee sau tg di, hn ores ect v om ih eee eee8 Multilevel theories, multi-stage sampling, and multilevel models Questions that we seek to answer may be: do employees in multinationals earn more than employees in other firms? Or: is there a relation between the performance of pupils and the experience of their teacher? Or: ia the sen. tence differential between black and white suspects diferent betwoen judges, and if so, can we find characteristics of judges to which this sentence differ. ental is related? In this case a variable is defined at the primary unit level (Gums, teachers, judges) as well as on the secondary unit level (employees, pupils, cases). From here on we will refer to primary units as macto-level units (oF macro-unts for short) and to the secondary units aa micro level units (oF micro-unts for short). Moreover, forthe time being, we will re strict ourselves to the two-ievel case, and thus to two-stage samples only. In Table 2.1 a summary of the temuinology is given. Examples of macro-unite and the micro-units nested within them are presented in Table 2.2 ‘Table 2.1 Summary of terms to describe units at either level in the two-level case. rmlero-units secoudary units slementary units level? units level units ‘Table 2.2 Some examples of units at the macro and micro level. Tiere res evet — “schools teachers classes pupils seighborhoods families firms employers jawbones teeth faxes children Titers animals doctors patients subjects measurements interviewers respondents Judges suspects ‘Most of the examples presented in the table have been dealt with in the text already. It is important to note that what is defined as a macro-unit ‘and a micro-unit, respectively, depends on the theory at hand. Teach- rs are nested within schools, if we study organizational effects on teacher bum-out: then teachers are the micro-units and schools the macro-units. But when studying teacher effects on student achievement, teachers are the ‘macro-units and students the micro-units. The same goes, mutatis mu. ‘Macro-level, micro-level, and eross-level relations 8 tandis, for neighborhoods and families (eg when studying the effects of housing conditions on marital problens), and for families and children (e,, when studying effects of income on educational performance of siblings) In all these instances the dependency of the observations on the micto- ‘units within the macro-units is of focal interest. If we stick to the example of schools and pupils then the dependency (e.g, in mathematics achievement of pupils within a school) may stem from: 1. pupils within school sharing the same school environment; 2 pupils within a school sharing the same teachers; 5. pupils within a school affecting each other by direct communication or shared group norms; 4. pupils within a school coming from the same neighborhood. ‘The more the achievement levels of pupils within a school are alike (as compared to pupils from other schools), the more likely tis that causes for the achievement have to do with the organizational unit (in this case: the school). Absence of dependency in this cae implies absence of institutional effects on individual performance. A special kind of nesting is defined by longitudinal data represented in ‘Table 2.2 as ‘measurements within subjects". The measurement occasions here ate the micro-units and the subjects the macro-units. The dependence of the diferent measurements for a given subject is of primary importance in longitudinal data, but the following section about relations between variables defined at ether level is not directly intended forthe nesting structure defined by longitudinal data. Because of the special nature ofthis nesting structure, separate chapter (Chapter 12) is devoted toi. 2.3 Macro-level, micro-level, and cross-level relations For the study of hierarchical, or multilevel systems, baving two distinct layers, Tacq (1986) distinguished between three kinds of propositions: about ‘micro-units (e.g., ‘employees have on average 4 effective working hours Pet ay’; ‘boys lag behind girls in reading comprehension’), about macro-units (ee, ‘schools have on average budget of $20,000 to spend oa resources’; in ‘neighborhoods with bad housing conditions crime rates are above average’), or about macro-micro relations (e.g, ‘if firms have a salary bonus system, employees will have increased productivity’; ‘a child suffering from a broken family situation will affect classroom climate’). Multilevel statistical models are always needed if a multi-stage sampling design has been employed. The use of such a sampling design is quite obvious if we are interested in macro-micro relations, less obvious - but often necessary from a cost-effectiveness point of view ~ if micro-level propos! tions are our primary concern, and hardly obvious ~ but sometimes still applicable ~ if macro-level propositions are what we are focusing on. These three instances will be dealt with in the sequel of this chapter. To facili-10 Multilevel theories, multi-stage sampling, and multilevel models tate comprehension, following Tacg (1986) we use figures with the following conventions: ‘a dotted line indicates that there are two levels; below the line is the micro-level; ‘above the line is the macro-level; ‘macro-level variables are denoted by capitals; ‘micro-level variables are denoted by lower case letters; arrows denote presumed causal relations. Multilevel propositions Multilevel propositions can be represented as in Figure 2.2. Day Figure 2.2 The structure of a multilevel proposition In this example we are interested in the effect of the macro-level variable Z (eg., teacher efficacy) on the micro-level variable y (eg., pupil motivation) controlling for the microlevel variable z (eg., pupil aptitude). Micro-level propositions Micro-level propositions are of the form indicated in Figure 2.3. Figure 2.8 The structure of a micro-evel proposition, To this case the line indicates that there isa macro-evel that isnot referred ton the hypothesis that is put tothe test, but that is used in the sampling design in the fist stage. In assessing the strength ofthe relation between cccupational status and income, for instance, respondents may have been selected for face-to-face interviews per zip-ode area. This then may case dependency (as a nuisance) in the data. Macro-level propositions Macro-level propositions are of the form of Fi re 2.4. Macro-level, micro-leve, and crose-level relations n Figure 2.4 The serucare of a macro-level proposition, ‘The line separating the macre-level from the micro-level seems to be super- uous here. When investigating the relation betweén long-range strategic planning policy of firms and their profits, there is no multilevel situation, and a simple random sample may have been taken. When either or both variables are not directly observable, however, and have to be measured at the microlevel (eg, organizational climate measured as the average satisfaction of employees), then a two-stage sample is needed nevertheless. This is the case a fortiori for variables defined as aggregates of microrlevel variables (eg, the crime rate in a neighborhood). Macro-micro relations ‘The most common situation in social research is that macro-level variables are supposed to have a relation with micro-level variables. There are three obvious instances of maero-tomicro relations, all of which are typical examples of the multilevel situation. z Zz z ee Figure 2.5 The stracture of macro-micro propesitions. ‘The first case is the macro-to-micro proposition. The more explicit the religious norms in social networks, for example, the more conservative the views that individuals have on contraception. ‘The second proposition is 2 special case of this. It refers to the case where there is a relation between Zand y, given that the effect of z on y is taken into account. The example given may be modified to: for individuals of a given educational level’. The last case in the figure is the macro-micro-interaction, also known 33 the cross-level interaction: the relation between and y is dependent on Z. Or stated otherwise: the relation between Z and y is dependent, on 2 The effect of aptitude or achievement, for instance, may be small in case of ability grouping of pupils within classrooms but large in ungrouped classrooms, [Next to these three situations thera isthe so-called emergent, or micro~ ‘macro, proposition (Figure 26).12 Multilevel theories, multi-stage sampling, and multilevel models ot Figure 2.6 The structure of a micro-macro proposition. In this case, a micro-level variable x affects a macro-level variable Z (student achievement may affect teacher's stress experience) ‘There are of course combinations of the various examples given. Figure 2.7 contains a causal chain that explains through which miero-variables there is an association between the macro-level variables W and Z (cf. Coleman, 1990). Figure 2.7 A causal macro-micro-micro-macro chain, ‘An example of this chain: why do the qualities ofa football coach affect is social prestige? The reason is that good coaches are capable of motivating their players, thus leading the players to good performance, thus to winning games, and this of course leads to more social prestige for the coach. An- other instance of a complex multilevel proposition is the contextual effects proposition. An example: ‘low socio-economic status pupils achieve less in classrooms with alow average aptitude’. Tis is also acrost-level interaction effect, but the macro-level variable, average aptitude in the classroom, now is an aggregate of a micro-level variable. In the next chapters the statistical tools to handle multilevel structures will be introduced for outcome variables defined at the micro-leval 3 Statistical Treatment of Clustered Data Before proceeding in the next chapters to explain ways for the statistical modeling of data with a multilevel structure, we focus attention in this chapter on the question: what will happen if we ignore the multilevel structure of the data? Are there any instances where ove may proveed with single- level statistical models although the data stem from a multistage sampling esign? What kind of errors may occur when this is done? Next to this, we present some statistical methods for multilevel data that do not use the hierarchical linear model (which receives ample treatment in following chapters). First, we describe the intraclass correlation coeficient, 22 basic measure for the degree of dependency in clustered observations. Second, some simple statistics (mean, standard error ofthe mean, variance, correlation, reliability of aggregates) are treated for two-stage sampling designs. The relations are spelled out between within-group, between-group, ‘and total regressions; and similarly for correlations. Finally, we mention some simple methods for combining evidence within groups into an overall test. 3.1 Aggregation ‘A common procedure in social research with two-level data is to aggregate the micro-level data to the macro-level. The simplest way to do this is to ‘work with the averages for each macro-unit, ‘There is nothing wrong with aggregation in cases where the researcher is only interested in macro-level propositions, although it should be borne in mind that the reliability of an aggregated variable depends, among others, oon the number of micro-level units in a macro-level unit (see later in this chapter), and thus will be larger for the larger macro-units than for the smaller ones. In cases where the researcher is interested in macro-micro or rmicro-level propositions, however, aggregation may result in gross errors. ‘The frst potential error is the ‘shift of meaning’ (cf. Hitter, 1981). A variable that is aggregated to the macro level refers to the macro-units, not directly to the micro-units, ‘The firm average of a rating of employees on their working conditions, eg., may be used as an index for ‘organizational climate’. This variable refers to the firm, not directly to the employees. 13u Statistical treatment of clustered data ‘The second potential exor with aggregation is the ecological fallacy (Robinson, 1950). A correlation between macrolevel variables cannot be teed to make assertions about micro-leve rations. The percentage of black Inhabitants in a neighborhood could be related to average political views in the neighborhood, eg., the higher the percentage of blacks in a neighborhood, te higher might be the proportion of people with extreme right-wing polieal views. This, however, does aot give us any clue about the micro- Ievel tation between race and political conviction. (The shift of meaning plays a role here, too. The percentage of black inhabitant is a variable that Ineans something for the neighborhood, and this meaning is distinct from the meaning of ethnicity 2s an individual-level variable.) The ecological and other related fallacies are extensively dlscusted by Alker (1969). ‘The third potential eror is the neglect of the original datastructue, specially when some kind of analysis of covariance is to be used. Suppose one is interested in asseaing between school differences in pupil achieve- ‘ment after correcting for intake diferences, and that Figure 3.1 depicts the true situation. The figure depicts the situation for five groups, for each of ‘which we have five observations, The groups are indicated by the shapes ©, x,+,0, and ». The five group means are indicated by «. Figure 3.1 micro-level versus macro-level adjustments, (X,¥) values for five groups indicated by #, 0,4, x03 group averages by «. Now suppose the question is: do the diferences between the groups on the variable Y,, after adjusting for differences on the variable X, have a substantial size? The micro-level approach, which adjusts for the within- sroup regression of Y on X, will lead to the regression line that hasa postive slope. In this picture, the micro-units from the group that have the O- symbol are all above the line, whereas the micro-units from the e-group are all under the regression line. The micro-level regression-approech thus will Jead us to conclude that the five groups do differ given that an adjustment for X has been made. Now suppose that we would aggregate the data, and regress the average Yon the average X. The averages are depicted Disaggregation 15 by «. his situation is represented in the grap by the regression ine with 2 negative slope. ‘The averages of all groups are almost perfectly on the regresion-line (the observed average 7 can almost perfectly be predicted ffom the observed average X), thus leading us to the conclusion that there are almost no differances between the five groups after adjusting for the average X. Although the situation depicted in the graph is an idealized example, it clearly shows that working with aggregate data ‘is dangerous at best, and disastrous at worst’ (Aitkin and Longford, 1986, p. 42). When analysing multilevel data, without aggregation, the problem described in this paragraph can be dealt with by distinguishing between the within- ‘groups and the between-groups regressions. Tis is worked out in Sections 3.6, 45, and 9.2.1 ‘The last objection against aggregation i, that it prevents from examining potential cros-level interaction elfects of a specified micro-level variable with an as yet unspecified macro-level variable. Having aggregated tne data to the macro-level one cannot examine relations like: isthe sentence difer- ential between black and white suspects different between judges, when allowance is made for differences in seriousness of crimes? Or, to give another example: isthe effect of aptitude on achievement, present in the case of whole class instruction, smaller or even absent in case of ability grouping of pupils within classrooms? 3.2 Disaggregation Now suppose that we treat our data at the micro level. There are two situations: 1. we also have a measure of a variable at the macro level, next to the ‘measures at the micro level; 2. we only have measures of micro-level variables. In situation (1), disaggregation leads to ‘the miraculous multiplication of the number of unite. To lustrate what is meant: suppose a researcher is interested in the question whether oer judges give more leient sentences than younger judges. A two-stage sample is taken: in the fist stage ten judges are sampled, an inthe second stage per judge te trials are sampled (in total there ae thas 10 x 10 = 100 tras). One might disaggregate the data to the level of the trials and estimate the relation between the experience ofthe judge and the legth of the sentence, without taking into account that some trials involve the same judge. This i like pretending that there aze 100 independent observations, whereas in actual fact there aze only 10 independent observations (the 10 judges). ‘This shows that disoggregation and treating the data as if they are independent implies that the sample size i dramatically exaggerated, For the study of between- stoup diferences, disaggregation often lends to serious risks of committing type I errors (asserting on the basis of the observations that there is a16 Statistical treatment of clustered data difference between older and younger judges whereas in the population of judges there is no such relation). On the other hand, for studying within- ‘group differences, disaggregation often leads to unnecessarily conservative tests (Le., too low type I error probabilities); this is elaborated in Moerbeek et al. (1997). Tf only measures are taken at the micro level, analysing the data at the micro level isa correct way to proceed, as long as one takes into account that observations within a macro-unit may be correlated. In sampling theory, this phenomenon is known as the design effect for twostage samples. If one wants to estimate the average management capability of young managers, ‘while in the first stage a limited number of organizations (say 10) are selected and within each organization five managers are sampled, one runs the risk cof making an error if (as is usually the case) there are systematic differences between organizations, In general, two-stage sampling leads to the situation that the ‘effective’ sample size that should be used to calculate standard errors is less than the total number of cases, the latter being given here by ‘the 50 managers. The formula will be presented in one of the next sections. Starting with Robinson's (1950) paper about the ecological fallacy, many papers have been written about the possibilities and dangers of cross-level Inference, i.e, methods to conclude something about relations betwee micro-units on the basis of relations between data at the aggregate level, or conclude something about relations hetween macro-units on the basis of relations between disaggregated data. A concise discussion and many references are given by Pedharur (1982, Chapter 13) and by Aitkin and Longford (1986). Our conclusion is that if the macro-units have any mean- ingfal relation with the phenomenon under study, analysing only aggregated or only disaggregated data is apt to lead to misleading and erroneous conclusions. A mlilevel approach, in which within-group and between-group relations ate combined, is more dificult but much more productive. This approach requires, however, to specify assumptions about the way in which ‘macro- and micro-effects are put together. The present chapter presents some multilevel procedures that are based on only a minimum of such assumptions (eg., the additive model of equation (3.1)). ‘The further chapters ‘of this book are based on a more elaborate model, the so-called hierarchical linear model, since about 1990 the most widely accepted basis for multilevel analysis. 3.3 The intraclass correlation ‘The degree of resemblance between micro-units belonging to the same macro- tunit can be expressed by the intraclass correlation coefficient. ‘The term ‘classi conventionally used here and refers to the macro-units in the classification system under consideration. There are, however, several definitions cof this coefficient, depending on the assumptions about the sampling design. In this section we assume a two-stage sampling design, and infinite populations at either level. The macro-units will also be referred to as groups. ‘The intraclass correlation wv {A relevant model here isthe random effets ANOVA model.” Indicating bby Yi the outcome value observed for szicro-unit i within macro-unit j, this ‘model can be expressed as Yy =a + U; + Ry. G1) ‘where 4 is the population grand mean, Uj is the specific effect of macro- unit j, and Ry is the residual effect for micro-unit i within this macro- unit. In other words, macro-unit j has the ‘true mean’. + Uj, and each measurement of a micro-unit within this macro-unit deviates from this true mean by some value, called Ry. Units differ randosnly from one another, ‘which is reflected by the fact that U; is a random variable and the name ‘random effects mode". Some units have a high true mean, corresponding to ‘high value of U;, others havea close to average, stil others alow true mean. Its assumed that al variables are independent, the group effects U; having Population mean 0 and population variance 7? (the population between: group variance), and the residuals having mean 0 and variance a (the Population within-group variance). For example, if micro-units are pupils ‘and macro-units are schools, then the within-group variance is the variance ‘within the schools about ther true means, while the between-group variance 1s the variance between the school true means. The total variance of Yi 1s then equal to the sum of these two variances, var(¥y) =7? + 0? ‘The number of micro-units within the j'th macro-unit is denoted by ns ‘The number of macro-units is N, and the total sample size is M = 3, nj In this situation, the intraclass correlation coefficient p, can be defined poration variecce Beeween SARGDO- Wile Pmt 2 Total variance ara 6) It is the proportion of variance that is accounted for by the group level. This parameter is called a correlation coeficient, because it is equal to the comelation between values of two randomly drawa micro-units in the same, randomly drawn, macro-unit. tis important to note thatthe population variance between macro-units 1s not dicectly reflected by the observed variance between the means of the ‘macro-uniis (the observed between macro-units variance). ‘The reason 1s that in a twostage sample, variation between micro-units will alzo show up as exra observed variance between macrounite. Its indicated below how ‘the observed variance between cluster means must be adjusted to yield a good estimator for the population variance between macro-units. This model is also known 1a te statistical Iterature a8 the one-way random effects ANOVA model and as Eiwabar' Type If ANOVA madd. In molleel modeling ti known as the empty model, which istreated further In Section 43,8 Statistical treatment of clustered data Example 3.1 Random data. Suppose we havea series of 100 observations as inthe random digits Table 3. ‘Table 3.1. Data grouped into macro-units (random digits from Glass and Stanley, 1970, p. 511). 7 Scores Vj for miero-unita (random digs) Average Py Tee 2 eos os oe eT mo Mm 6 2 SLO % 3 9% OD 74 0 3 Mo 8 mM MMS Of 19 32 35 Me 4s ST 2 OS OS 0 i 2 0 a7 a7 Of 39 53 4 OBST OS 3 7S 15 72 68 SB 00 83 38 LL oe 4 2 om Ue Ds OK wT 0 7 & OF mw aT © 2 8 oo 7 Ow ES oO 7% 2 2 oo m 3 © 2 2 we SA ‘The core part of the table contains the raadom digits. Now suppose that cach row in the table is a macro-unit, so that for each macro-unit we have ‘observations on 10 micro-units, The averages of the scores for each macro- unit are inthe last column. There seem to be lage diferonces between tbe randomly constructed macro-units, if we look at the variance in the macro- tunit averages (which is 105.7). The total observed variaace between the 100, ‘micro-unite a 614.0. Suppose the macro-nnits were schools, the micro-units pupils, and the random digits test scores. According to these two observed ‘variances we might conclude that the schools differ considerably with respect. to their average testscores. We know in this case, however, that in ‘reality’ the macro-unite differ only by chance. ‘The following subsections show how the intraclass correlation can be estimated and tested. For a review of various inference procedures for the intraclass correlation we refer to Donner (1986). An extensive overview of many methods for estimating and testing the within-group and between- group variances is given by Searle, Casella, and McCulloch (1992). 3.3.1 Within-group and between-group variance ‘We continue referring to the macro-units as groups, Todisentangle the infor- ‘mation contained in the data about the population between-group variance and the population within-group variance, we consider the observed vari- ‘ance between groups and the observed sariance within groups. ‘These are defined in the following way. The mean of macro-unit j is denoted ‘The intraclass correlation 19 ‘The observed variance within group j i given by i< \2 S$ = Ls - Pa) ‘This number will vary from group to group. To have one parameter that expresses the withingroup variability fo all groups jointly, one uses the observed within-group variance, or pooled within group variance, This is a vreighted average of the variances within the various macro-nits, defined WUE = (33) ~ 338}. If model (8.1) hols, the expected value ofthe observed within-group vari sees ext oul the poston ikon verano Expected variance within = ES2.tin (4) ‘The situation forthe between-group variance is abit more complicated. For equal group sizes ny, the observed between-group variance is defined as the variance between the group teans, 1d, 2 os Des - 2 5) For unequal group ses, the contributions of the various groups need to be ‘weighted. The following formula uses weights that are useful for estimating the population between-group variance: w-p Se = EWG DHE FP (38) In this formula, 7 is defined by 1 yr (ny) =a {u- 3} WR en where i = M/N is the mean sample size and #0) = xt dew -a is the variance of the sample sizes, If all nj have the same value, then also has this value. in ths cae, Steen jst the variance ofthe group means, given by (35) Tt can be shown thatthe total observed variance isa combination of the within-group and the between-group vaianoes, expressed as follows: 1 LL - vy observed total variance2» ‘Statistical treatment of clustered data al Seen + ‘Seeeween + (a) MoT ‘Toe complications with respect to the between-group variance arse fom the ac that the te levelresidvals Ry ano contrite, although to 4 tino extent, tothe obervedbebwemn-group variance Staistialthdry ee Expected chard varatoe between ~ Tw variance between + Expected sampling eror variance More specially, the formula (Hays (198, Seton 1.) forthe case with cnstat nj and Seas, Can, and McCulloch (1992, Section 3.6) fr the general case EStewen=7 +E, (9) which holds provided that model (3.1) is valid. The second term in this female betote all when hs besten Lange, ‘Th for arg gu ses, the expetedclerved between varie @ peoicaly ual tothe te between variance. For small group sizes, however, it tends to be larger than the tue betwee vance due Yo the random difeencs thet alo exit between the group means. Tn practice, we do aot know the population values ofthe betwee and within macrotit variances thee have to be etinated from the data ‘One way of estimating these parameters is based on formulae (3.4) and (35). From the fit flows thatthe population within-group variance, o, can be estimated unbiasedly by the observed within-group variance: P= Shean (a10) From the combination ofthe ast to formulae follows thatthe poplation etnee-goup vrae, 7, canbe eiated abiandly by taking the observed between gouje tariance and aublatig te eotbution that eve rth group valance mais, n serage,acsrdingt (20), tothe cheered etwen-group variance Pa, mee ~ “Babin. eu (Another expression is given in (3.14)) This expression can take negative values. This happens when the difference between group means is less than ‘would be expected on the basis of the within-group variability, even if the true between-greup variance 1? would be 0. In such a case, It is natural to estimate 7? as being 0. It can be cozeluded that the split between observed within-group variance and observed between-group variance does not correspond precisely to the split between the within-group and between-group variances in the population: the observed between-group variance reflects the population between-group variance plus a bit ofthe population within-group variance. ‘The intraclam correlation is estimated according to formula (3.2) by es (uz) ‘The intraclass correlation a (Formula (3.15) gives another, equivalent, expresion.) The stendard error of this estimator in the case where all group sizes are constant, nj =n, it given by nae ae=DW=D ‘This formula was given by Donner (1986, equation (6.1)), who alo gives the (quite complicated) formula for the standard error for the case of variable group sizes. ‘The estimators given above are so-called analysis of variance (ANOVA) estimators. They have the advantage that they can be represented by explicit formulae. Other much used estimators are those produced by the max- {mum likelihood (ML) and residual maximum likelihood (REML) methods (cf. Section 4.8). For equal group sizes, the ANOVA estimators are the sare as the REML estimators (Searle, Casella and McCulloch, 192). For un- ‘equal group sizes, the ML and REML estimators are slightly more efficient than the ANOVA estimators. Multilevel software can be used to calculate the ML and REML estimates. Example 3.2 Within- and between-group variability for randor: data. For our random digits table of the enrlier example the observed berween var ‘ance i Serween = 105.7. The observed variance within the macrounits can ‘be computed fom formula (3.8). The observed total variance is known to be {814.0 and the observed between variance is given by 105.7. Solving (3.8) for the observed within variance yields S2anuq = (09/90) x (814.0 ~ (10/11) x 106.7) = 789.7. Te estimated true variance within the macro-units then also is 6? = 780.7. The estimate for the true between macro-uits variance is ‘computed from (3.11) as #8 = 105.7 ~ (789.7/10) = 267 Finally, the estimate of the intraclass corrlation is > = 2627/(789.7 + 26.7) = 003. Tes standard error, computed from (3.13), is 0.06. == p+ (2 Der) (13) 3.9.8 Testing for grosp differences ‘The intraciass correlation a8 defined by (3.2) can be zero or positive. A statistical text can be performed to investigate if a postive value for this Coefcient could be attributed to chance. If may be assumed that the ‘within-group deviations Ry are normally distributed, one can se an exact, {est for the hypothesis that the intraclas correlation is 0, which is the Same asthe null hypothesis that there are no group diferencs, othe true between-group variance is 0. This is just the F-test fora group effect in the one-way analysis of variance (ANOVA), which canbe found in any textbook ton ANOVA. The test statistic can be written as ro, and it has an F distribution with N~1 and MN degres of freedom if the ml hypothesis holds.2 Statistical treatment of clustered data Example 8.8. The F-test forthe random dataset Forth data of Table 31, F= (10 106)/7097 = 1-34 wth 9 and 90 degrees of Som, This valve i ar fom signicast(p > 0.0). Thus, there Bo idence of tre bebwoen- grou difference. Statistical computer packages usually give the F statistic andthe withia- soup valance, S4y.q. From ths outpt, the extimated population bebween- froup variance cane calulated by # = Ser —1) au) and the eatinated intraclass correlation coeficient by ares a=sogt (315) where fis given by (8.7). IF F< 1, it is natural to replace both of these expressions by 0. These formalae show that a high value forthe F statistic wl lead to large estimates for the between-group variance as well asthe intraclass correlation, but that the group sizes, as expresed by &, moderate the relation betwoen the tst statistic and the parameter estimates. there are covariates, t often i relevant to test whether there are group ferences in addition to thove accounted fr bythe eect of the covariates ‘This is achieved by the usual F-test forthe group effect in an analysis of covariance (ANCOVA). Such ates is relevant because iti possible thatthe ANOVA F-test doesnot demonstrate any group efets, but that soch effects do emerge when controling forthe covariate (r vice versa). Another check on whether the groups make a diffrence canbe carried out by testing the sr0up-by-covariate interaction effec. These tests can be found in texibooks fon ANOVA and ANCOVA, and they are contained inthe well-known general prpose statistical computer programs. So, to test whether a given nesting stracture in a data set calls for rltileel analysis, one can use standard techniques from the analysis of variance. In addition to testing for the main group effect, itis also advis- ale to tet for group-by-covarite interactions. If there is neither evidence for main eect nor for interaction eects involving the group sracture, then the researcher may leave aside the nesting structure and analyse the data by unilevel methods such as ordinary least aquares (OLS) regression analysis, This approach to text for group diferences can be taken when- ver the numberof groups isnot too large for the computer program being ‘used. If there are too many groups, however, the program will refuse to do the job. Iz such a cae it wil stil be posible to cary out the testa for soup diferences that are treated in the folowing chapters following the logic ofthe hierarchical near model. Thi wil require the use of statistical milters! software. 3.4 Design effects in two-stage samples In the design of empirical investigations, the determination of sample sizes is an important decision. For two-stage samples, this is more complicated Design effects in two-stage samples 23 than for simple (‘one-stage’) random samples. An elaborate treatment of this question is given in Cochran (1977). This section gives a simple approach to the precision of estimating population mean, indicating the basic role played by the intraclass correlation. We return to this question {in Chapter 10. Large samples are preferable to increase the precision of parameter estimates, Le, to obtain tight confidence intervals around the parameter eatimates, In a simple random sample the standard errr ofthe mean is related to the sample size by the formula standard deviation Veampk eae This formula can be used to indicate the required sample size (in a simple random sample) if a given standard error is desired. ‘When using two-stage amples, however, the clustering of the data should bbe taken into account when determining the sample size. Let us suppose that all group sizes are equal, nj = n forall j. The (total) sample size then 1s Nn. The design effect is « number that indicates how much the sample size in the denominator of (3.16) isto be adjusted because of the sampling design used. Its the rato ofthe variance obtained with the given sampling design to the variance obtained for a simple random sample from the same Population, supposing thatthe total sample size is the same. A large design cect implies a relatively large variance, which isa disadvantage that may be offset by the cost reductions implied by the design. The design effect of 8 tworstage sample with equal group sizes is given by design effect =14 (na (317) ‘This formula expresses that, from a purely statistical point of view, a two- stage sample becomes les attractive as p increases (clusters become more homogeneous) and as the group size n increases (the two-stage nature of ‘the sampling design becomes stronger). Suppose, eg. we were studying the satisfaction of patients with their doctors treatments. Furthermore, let us assume that some doctors have ‘ore satisfied patients than others, leading to a p. of 0.90. The researchers used a two-stage sample, since that is far cheaper than selecting patients simply at random. They first randomly selected 100 doctors, from each chosen doctor selected five patients at random, and then interviewed each of these. In this case the design effect is 1 + (5 ~ 1) x 030 = 2.20. When imating the standard error of the mean, we no longer can treat the obser- ‘ations as independent from each other. The aféective sample sz, Le, the ‘equivalent total sample size that we should use in estimating the standard error, is equal to standard error (3.16) Nn eae’ (a8) {n which JV is the number of selected macro-units. For our example we find Negssive = (100 x 5)/2.20 = 227. So the twostage sample with a total of 500 patients here is equivalent to.a simple random sample of 227 patients. Netecive =m ‘Statistical treatment of clustered data (One can also derive the total sample size using a two-stage sampling design on the basis of a desired level of precision, assuming that pis known, ‘and fixing n because of budgetary or time-related considerations. The gen- feral rule is: this required sample size increases as p, increases and it i creases with the number of micro-units one wishes to select per macro-uit. Using (3.17) and (8.18) this can be derived numerically from the formula Nia = Naw + New (02), « ‘The quantity Ni inthis formula refers to the total desired sample size when ‘using a tworstage sample, whereas Nyy refers to the desired saznple size if ‘one would have used a simple random sample. Tn practice, p:is unknown. However, it often is possible to make an educated guess about it on the basis of earlier research. In Figure 3.2, Ni is graphed as a function of n and p, (0.1, 0.2, 0.4, and 0.8, respectively), and taking New = 100 as the desired sample size for an ‘equally informative simple random sample. Me 3000 p =08 2000 po =04 1000 20 i w Dn Figure 3.2 The total dested sample size in two-stage sampling, Reliability, as conceived in paychological test theory (e.g., Lord and Novick, 1068) and in generalizability theory (eg, Shavelson and Webb, 1991), is Closely related to clustered data ~ although this may not be obvious at frst sight. Classical paychological test theory considers a subject (an individual ‘or other observational unit) with a given true score, of which imprecise, or unreliable, observations may be made. The observations can be considered to be nested-within the subjects. If there is more than one observation per subject, the data are clustered. Whether there is only one observation ‘or several, equation (3.1) is the model for this situation: the true score of subject j is 4+ Us and the j'th observation on this subject is Yi, with Relat ty of aggregated variables 25 ‘associated measurement error Ri. I several observations are taken, these can be aggregated to the mean value Ys which then is the measurement for the true score of subject j. ‘The same idea can be used when itis not an individual subject whois to be measured, but some collective entity: a school, frm, or in general any macro-level unit such as those mentioned in Table 2.2. For example, when the school climate is measured on the boris of questions posed to pupils of the school, then Yi; could refer to the answer by pupil fin school j toa given question, and the opinion ofthe pupils about this school would be measured by the mean value Ys. In terms of psychological tet theory the micro-level ‘units ¢ are regarded as parallel items for measuring the inacro-evel unit j. The reliability of a measurement is defined generally 0s sebility = _vatiance of true scores reliability = “Seance of Observed score * 1t can be proved that this is equal to the correlation between independent replications of measuring the same subject. (This means in the mathematical model that the same value Uy is measured, but with independent realizations of the random error Ry.) The relabilty is indicated by the symbol Ay? For measurement on the bass of a singe observation according to model (G41), reliability is just the intraclass correlation coefcent: Naps apa (iin When several measurements are made for each macrolevel unit, these constitute a cluster or group of measurements which are aggregated to the group mean Y,. To apply to Ys the general definition of reliability, note that the observed variance isthe variance betweun the observed means ¥., while the true variance is the variance betweea the true scores 4 + Uj ‘Therefore the reliability ofthe aggregate is vaviance between 4 + Uj variance between F (3.19) reliability of Pj = (8.20) Example 3.4 Reliability for random data. Tin our previous random digits example the digits represented, eg,, the perceptions by teachers in schools of their working conditions, then the ag iregated variable, an indicator for organizational climate, has an estimated reliability of 96.7/108.7 = 0.25. (The population value ofthis reliability is 0, Dowever, as the data are random, so true variance i ail.) It can readily be demonstrated that the reliability of aggregated variables increases as the number of micro-units per macro-unit increases, since the ‘rue variance of the group mean (with group size n;) is 7? while the expected ia the Uteratare the reobilty of a measurement X frequently denoted by the symbol pxxy 0 thatthe rlibity ecient Ay could alo be Ganced by py6 ‘Statistical treatment of clustered dats observed variance of the group mean is 7? +02/n,. Hence the reliability tan be exprened by 2 = iors . (3.21) 25am “TF yA a Ws quite clear that if nl very large then Ay almot 1, In = Uwe ace no able to dtngulh between wte: and between group variance. Figure 3.3 presents a graph where the reliability of an aggregate is depicted as a fundion of y (lencted by) and (1 and OA, epee) N= as 10 04 Figure 3.8 Reliability of aggregated variables. 3.6 Within- and between-group relations ‘We saw in Section 3.1 that regressions at the macro-level between aggregated variables X and Y can be completely different from the regressions between the micro-level variables X and Y. This section considers in more detail the interplay between macro-level and micro-level relations between two variables. Flrst the focus is on regression of ¥ on X, then on the correlation between X and ¥. ‘The main point of this section is that within-group relations can be, in principle, completely different from between-group relations. This is natural, because the processes at work within groups may be diferent from the processes at work between groups (see Section 3.1). ‘Total relations, i.e. relations at the micro-level when the clustering into macro-units is disregarded, are mostly a kind of average of the within-group and between-group relations, Therefore it is necessary to consider within- and between-group relations jointly, whenever the clustering of micro-units in macro-units is ‘meaningful for the phenomenon being studied. Within- and between-group relations a 3.6.1 Regressions ‘The linear regression of a ‘dependent’ variable ¥ on an ‘explanatory’ or Sndependent’ variable X is the linear function of X that yields the best? prediction of ¥. When the bivariate distribution of (X,Y) is known and the data structure has only a single level, the expression for this regression function is + aX+R, where the regression coefficients are given by = EY) ~ rE(X), cor(X,¥) a ‘The constant term fy is called the intercept, while fis called the regression coefcient. The term R is the residual or error component, and expresses the part of the dependent variable ¥ that cannot be approximated by a linear function of Y. Recall from Section 1.2.2 that £(X) and £(Y) denote the population means (expected values) of X and Y, respectively. In a maltlevel data structure, this principle can be applied in various ways, depending on which population of X and Y values is being considered. [Let us consider the artificial dataset of Table 3.2. The fist two columns in the table contain the identification sumbers of the macro-unit (j) and ‘the micro-unit (3). The other four columns contain the data, By Xj is denoted the variable observed for micro-unit i in macro-unit j, and by X the average of the Xis values for group j. The analogous notation holis for the dependent variable Y. ‘Table 8.2 Artificial date on § macro-units eack with 2 micro-uaits soon ou ewes ao nannvunnld| jew enaamaass One might be interested in the relation between ¥y and Xy. The linear regression line of ¥is on Xy at the microrlevel for the total group of 10 observations is ¥ij =5.33 - 0.33%y +R. (Total regression) Bex prodiction’ moana bere the podition hat has the smallet mean squared error: the ao-aled least aquares criterion.8 Statistical treatment of clustered data ‘This is the disaggregated relation, since the nesting of micro-units in macro- units is not taken into account. The regression coefiient is ~0.33. ‘The aggregated relation is the linear regression relationship at the macro- level of the group means ¥.y on the group means Xs, This regression line ¥5=8.00- 100%, +R. (Regression between group means) ‘The regression coeficient now is ~1.00. [A third option is to describe the relation between Yis and Xig within exch single group. Assuming that the regression coeficient has the same value in each group, ths is the same as the regression of the within-group Y¥-deviations (Vij Yj) on the X-deviations (Xiy—2,). This within-group regression line is given by Yy=Py + 1.00(Xy — £4) +R, (Regression within groups) ‘with a regression coeficient of +1.00. Finally, and that is how the artificial dataset was constructed, Yiy can be written as a function of the within-group and between-group relations ‘between Y and X. ‘his amounts to putting together the between-group ‘and the within-group regression equations. The result is Yiy = 8.00 ~ 1.00.5 + 1.00(Xiy ~ Xj) +R (22) 00+ L00X, - 200%; + R (Multilevel regression) Figure 3.4 graphically depicts the total, within-group, and between-group relations beween the variables. The five parallel ascending lines represent Total Within ¥ Figure 8.4 Within, between, and total relatioas the within-groups relation between Y and X. The steep descending line represents the relation at the aggregate level (ie., between the group means), ‘whereas the almost horizontal descending line represents the total relationship, ie, the micro-level relation between X and ¥ ignoring the hierarchical structure. Within- and between-group relations 29 ‘The within-group regression coefficient is +1 whereas the between group coefficient is -1. The total regression coefficient, ~0.33, is in between these two, This illustrates that within-group and between-group relations can be completely different, even have opposite signs. The true relation between Y and X is revealed only when the within- and between-group relations are considered jointly, Le, by the multilevel regression. In the multilevel regression, both the between-groups and the within-group regression coefficients play a role. Thus there are many different ways to describe the data, of ‘which one is the best, because it describes how the data were generated. In this artificial data set, the residual R is 0; real data have, of course, non-zero residuals, ‘A population model ‘The interplay of within-group and between-group relations can be better understood on the basis of a population model such as (8.1). Since this section is about two variables, X and Y, a bivariate version of the model is heeded. In this model, group (macrowunit) j has specie main effects Usg and Uy, for variables X and Y, and associated with individual (micro-unit) ‘are the within-group deviations Rey and Ry. The population means are denoted jzy and pty and it is assumed that the U's and the R's have population means 0. The U's on one hand and the R's on the other are independent. Te formula for X and Y then reads Xis = be + Uns + Rays Vig = ty + Uns + Pas - For the formulae that refer to relations between group means X, and Py, itis assumed that each group has the same size, denoted by n ‘The correlation between the group effects is defined as Preeween = p(Uz4, Uys) while the correlation between the individual deviations is defined by Prin = P(Reijs Ryts) One of the two variables X and ¥ might have a stronger group nature than the other, so that the intraclass correlation coeficients for X and ¥ may be diferent. ‘These are denoted by pix and pry, respectively ‘The within-group regression coefficients the regresion coeficient within ach group of Y on X, assumed to be the same for each group. This coeficient is denoted by Ayienin and defined by the within-group regression equation, Yig = My + Uys + Brisnia (Xiy ~ He ~ Ung) + Be (3-24) ‘This equation may be regarded as en analysis of covariance (ANCOVA) model for ¥. Hence the within-group regression coefficient also is the effect of X in the ANCOVA approach to this multilevel data. ‘The within-group regression coeficient is also obtained when the Y- deviation values (Yi ~ ¥j) ate regressed on the X-deviation values (3.23)20 Statistical treatment of clustered data (Xx - Xj). In other words itis also the regression coefcient obtained in the disaggregated analysis of the within-group deviation scores. ‘The population between-group repression coefficient is defined as the re sression coeficient forthe group effets U, on U,. This coficient is denoted Dy Boecwen U and i defined by the regression equation Taj = Booman u Uyj + Ry where R now isthe group-level residual. ‘The total regression coefficient of X on Y'is the regression cosffcient in the disaggregated analysis, Le, when the data are treated as single level data: Yis = by + Bras (Xj ~ Ma) + R- ‘The total regression coefficient can be expressed as a weighted mean of the within- and the between-groups codficients, where the weight for the ‘between-groups coefficient is just the intraclass correlation for X. The for- ‘mula is Prat = Pex Boeoeen U + (1 a) iin 25) ‘This expression implies that ifX isa pare macro-level variable (60 that pus = 1), the total regression coefficients equal tothe between-group coefficient. Conversely, if X is a pure microlevel variable we have fis = 0, and the total regression coefficient is just the within-group coefcient. Usually X will have both a within-group and a between-group component and the total regression coefficient will be somewhere between the two level-pecific regression coeficients Regressions between observed group means [At the macro-level, the regression of the observed group means Yj on Xj is not the same as the regression of the ‘true’ group effects U, on U,. This is because the observed group averages, X.j and P, can be regarded as the ‘true’ group means to which some error, Hey and Rs, has been added.® Therefore the regression coeficient for the observed group means is not exactly equal to the (population) between-group regression coeficien, but itis given by Breswcen group meses = Ass Boereen U + (1~Asj) Britny (8.28) where Nos is the reliability ofthe group means X for measuring s+ Us, given by equation (3.20) applied tothe X variable. Ifmis large the reliability will be close to unity, and the regression coeficient for the group means will be close to the between-group regression coefficient at the population level. Combining equations (3.25) and (3.28) leads to another expression for the total regression coefficient. This expression uses the correlation ratio 72 ‘which is defined as the ratio of the intraclass correlation coefficient to the reliability of the group mean, Pe + 0%/n ab aa sastion may be skipped by the curory reader ‘Tye sae phenomenon iba te bao ormeae (3.9) and (3.11), (327) Within- and between-group relations at For large group sizes the reliability approaches unity, so the correlation ratio approaches the intraclass correlation. Tn the data, the correlation ratio rf is the same as the proportion of vance in Xj explained by the group means, and it can be computed aa the ratio ofthe between-group sum of squares relative to the total sum of squares in an analysis of variance, ic., Eqns (hs~ K.P Lisa —¥F ‘The combined expression indicates how the total regression coefficient depends on the within-group regression coefcient and the regression coeff fcient between the group means: Prat = 1 Brerween group ream + (11) Bri (8.28) “Expression (3.28) was frst given by Duncan et al. (1961) and ean be found ‘also, eg, in Pedhazur (1982, p. 538). A multivariate version was given by Maddala (1971). To apply this equation to an unbalanced data set, the regression coeficient between group means must be calculated in a weighted regresion, group j having weight ny. Example 3.5. Within. ond htveen-sroup regressions for artificial dat In the artical example given, tho foal um of squares of Xi ad well as Yis i830 ad the betwees-poup suns of squares for X and ¥ are 20. Hence the correlation rile are 3/30 = 0.867. 1 we use this value and plug ‘eta formala (3.28), we Bind Beaten = 0.667 x (~1.00) + (1 ~ 0.667) x 1.00 = -033,, which i indeed what we found eater. 9.6.2 Correlations ‘The quite extreme nature of the artificial data set of Table 3.2 becomes apparent when we consider the correlations. ‘The group means (X,Y) le on a decreasing straight tine, so the observed between-group correlation, which is defined as the correlation between the group means, 13 Ryeween = 1. The within-group correlation is defined as the correlation within the groups, assuming that this correlation is the same within each group. This can be calculated as the correlation cocficient between the within-group deviation scores Xiy = (Xy- Xy) and Yiy = (Vy ~ Py). In this data set the deviation scores (Xij,%iq) are (-1,-I) for #'= 1 and (41,41) for # = 2, s0 the within-group correlation here is Rywhin = +1. Thus, we see that the within-group as well as the ‘between-group correlations are perfect, but of opposite signs. The disaggrre sated correlation, ie, the corzlation computed without taking the nesting structure into account, is Rroeai = 0:33. (This isthe same as the value for the regression coefficient in the total (disaggregated) regression equation, ‘because X and Y have the same variance.)a2 ‘Statistical treatment of clustered data ‘The population model again Recall that in the population model mentioned above, the corelation coe: ficient between the group effets Uy and Uy was defined a8 Poemen and the correlation between the individual deviations Rand Ry was dened as Petia. The intraclass correlation coeficients for X and Y were denoted by paces How do thee corelations between unobservable variables relate to com: relations between observables? The population within-group correlation is ‘also the correlation between the within-group deviation scores (Xi, Yy): ARF) = Pin 29) For the between-group coefclent the relation s, a8 always, a bit more complicated. The correlation coeficient between the group means is equal to AR5, P53) = Vag Mug Poeeneen + y/(1 ~ Aas) (1 ~ Avs) Pwisdins (8:30) where sj and Ay are the reliability coefficients of the group means (see equation (3.20)). For large group sizes the reliabilities will be close to 1 (provided the intraclass correlations are larger than 0), s0 that the correlation between the group means will then be close to uecween- This for- ‘mula shows thet the correlation between group means is higher than total correlation, i. aggregation will increase correlation, only if the between- groups correlation coefficient is larger than the within-groups correlation coefficient. ‘Therefore, the reason that correlations between group means are often higher than correlations between individuals is not the mathematical consequence of aggregation, but the consequence of the processes at the group level (determining the value of pyetween) being different from the processes at the individual level (which determine the value of phi) ‘The total corelation (i.e, the correlation in the disaggregated analysis) is a combination of the within-group and the between-group correlation coefficients. Tne combination depends on the intraclass correlations, as shown by the formula AX Yi) = VPP Poem + (l= Ha) A= Pry) Pein» (881) If the intraclass correlations are low, thea X and Y have primarily the nature of leve-one variables, andthe total correlation willbe close to the within-group carrelation; on the other hand, ifthe intraclass cortelations are close to 1, then X and ¥ have almost the nature of leve-two variables and the total correlation is close to the between-group correlation. If the intradass correlations of X and Y are equal and denoted by p,, then (8:31) can be formulated more simply as ALXiss Yi) = A Pern + (1) Pin In this case the weights p,and (1~ p:) add up to 1 and the total regression coefcient is necessarily between the within-group and the between-group regression coefizient. In general, however, ths is not always true, because the-sum of the weights in (3.31) is smaller than 1 ifthe intraclass correlations for X and ¥ ar different, For example, if one of the intraclass correlations Within- and between-group relations 33 is close to 0 and the other is close to 1, then one variable is mainly 2 level- ‘one variable and the other mainly a level-two variable. Formula (3.31) then implies that the total correlation coefficient is close to 0, no matter how large the within-group and the between-group correlations. ‘This is rather ‘obvious, since a level-one variable with hardly any between-group variability Cannot be substantially correlated with a variable with hardly any within sroup variability. Correlations between observed group meand# ‘Analogous to the regression coefficients, also for the correlation coefficients wwe can combine the equations to gee how the total corrdation depends on the within-group correlation and the correlation between the group means. This yields AXis.¥ig) = Ne my LLG Ps) + 1-12) 1p) Pwienin- (8-32) ‘This expression was given by Knapp (1977) and can also be found, eg, in edhazur (1982, p. 536). When iti applied to an unbalanced data st, the correlation between the group means should be calculated with weights ny. Tt may be noted that many texts do not make the explicit distintion between population and data. If the population and the data are equated, thea the rlabltes are unity, the corelation ratios are :he same asthe intraclass correlations, and the population between-group correlation is equal to the correlation between the group means. The equation for the total correlation then becomes Raoul = ey Reaneen + [01 #8) (1 = 99) Ren + 33) ‘When parameter estimation is being considered, however, confusion may be caused by neglecting this distinction. Example 3.8 Withn- and lebecn-groupcoraations for rificial data ‘The correlation ration in the artical data example are 72 =f = 0.667 and wwe aloo saw above that yin = +1 and Ryetren =~! Filing fa these Sumbers ia formala (3-33) yields Rest = VOGT x (-1.00) + (i =OB5TP x 1.00= -038 , ‘hich indeed isthe valve found earlier forthe total coreltion. 3.6.3 Estimation of within- and between-group correlations ‘There are several ways for obtaining estimates forthe corlation parameters treated in this section, "A quick method is based on the intraclass correlations, estimated as in Section 3.3.1 or from the output of a multilevel computer program, and the observed within-group and total correlations. The observed within-group correlation is just the ordinary correlation coefficient between the within- group deviations (Xy— X,) and (Yy — ¥.j), and the total correlation is the ordinary correlation coefficient between XX and Y in the whole data set. "be remainder of Seton 28 my also be skipped by the curry render.4 Statistical treatment of clustered data "The quick method then is based on (3.29) and (8:31). This leads to the cetimates Prithin = Penn» (3.34) Facet ~ YO= Pa) T= Fa) Roti From VE Bafa) ee (335) ‘This i ot the statistically met efcient method, bu tin strghtforward and lends to good results if sample sizes are not too small, ‘The ANOVA mathod (Searle, 1956) goes via the verances and cover: nog, based on the definition cov(X,Y) x) ARY) = ECR ean Estimating the withi- and between-group variances was ditcused in tion 3.3.1. The withi- and between-group covariance between X sad can be estimated by formulae analogous to (23), (3), (310) aad (Blt), replacing the squares (¥iy — P,)? and (Fy ~ ¥-)? by the crose-producte (Xi ~ K%y ~ Fp) and (Ry — KIL} — F). Mis shown in Sea, Casale, and MeCuiloeh (1902, Sctioa 1i"La) how thewe ealelations cot be replaced by a calculation involving oly sums of aquares Folly, the maximum Wialihood (ML) and. residual maximum ltl hood (REME) methods can bo used, Those are the most ten tstd tines, tion methods (cf. Section 4.6) and are implemented in multilevel software, Chapter 19 describes multivariate multilevel modes; the corsdation Seat ‘cient between two variables refers to the simplest multivariate situation, + vis. bivariate data, Formula (13.5) represents the model which allows the ‘estimation of within-group and between-group correlations, Example 87 Win nd teen ony clin Jo oa tt ininay eal a Gaps tab se ae ae toance cf M = ET pl te TS ea Te cons the dn bree the ones oe eat Cee t ungnge tt (F), Win phoa earenns ret oe seg oe tein fn pp gigs so aioe rain ta relay he tnd). Beeersiel corearae ae tine dtrse te sel’ popstar Se eg Seiborbnt) ante ndtence of aes ut aloe a Pen wikiowchonlcomaions ad the beeen caret se tte dite moos ‘The lds wit, “bere aod eta wl be areal to, 04 Ta orl sario w Re ™ 06D bare The ANOVA oir are cleat along rice Seton 3. ra ne en ced en ere en, timatai within-group population variances and comtiance (ef. (3.10)) are =Sy = 32.233, 65 = Se y = 64319,6,, = Sy xy = 28.516. ‘The observed between-group variances and covariance are. S. SB, = 20.580, 5, ee Re 3.483, ‘4.558, For this data set, = 17435. According to Combination of within-group evidence 35 (2.11) the extimated population between-group variances and covariance are #2 = 11.564, 73 = 16.890, fy = 12.92. "From these eitimated variances, the intraclass correlations are computed. 35 ps = 11.584/(L1.584 + 32.288) = 0.264 and pry = 16.890/(16.800 + 64.319) = 02000 ‘The ‘quick’ method uses the observed within and total correlations and the intraclas corelations, The reguling estimates are jw = 0.626 from (3.34) and Ay = 0.922 fom (3.35) ‘The ANOVA estimates forthe within. and betwees-group correlations ‘ses the estimated within- and between-group population variances and covariance. The results ae py = 28.516/ V32.259 X 64515 = 0626 and Dy = 12928/ SIL SBK x 16B00 = 0.924. ‘The ML estimates for the within-group variances und covariance are ob- {group variances and covariance they are ‘This lads to etimated correlations 28662/ /SE25 OLGA = 0621 and fy = 14.54/ YTETEX TOW = 0.938. Tecan be concladed that, fortis large dara et, the hres methods all ytd ‘practialy the same rele’ The wthi-schol(pupl lev!) correlation, 0.62, is substaatial, ‘Thus pupil language and arithmetic capacities are closely correlated, Tae betwoen-ichodl corratoa, 0.92 or 093, is very high. This demonstrates thatthe schol polices, the teaching quality, andthe proces that determine the composition ofthe school population Rave practically the samme effoct on the pupil’ language as on thet arithmetic performance. Note ‘hat te observed between-school correlation, 088, isa tite as high because ofthe attenuation caused by unreliability tht ‘llows fom (3.0). 3.7 Combination of within-group evidence ‘When researc focuses on within-group relations and several groups (macro- nits) were investigated, it's often desired to combine the evidence gathered in the various groupe. For example, consider a study where a relation between work satisfaction (X) and sickness leave (Y) is studied in several organizations. Ifthe organizations are suficiently similar, or if they can be considered as sample from some population of organizations, then the data can be analysed according to the hierarchical near mode ofthe next chapter. If, however, the organizations are too diverse and not representative of aay poplatio, then it still can be relevant to conduct one test for the relation between X and ¥ in which the evidence from all organizations is combined. Another example is meta-analysis, the statistically based combination of several studies. There exist many texts about meta-analysis eg, Hedges and Olkin (1985), Rosenthal (1991), Hedges (1992). A number of publications may contain information about the same phenomenon, and it can be important to combine this information in a single test. If the studies leading to the publications may be regarded as a sample from some population of studies, chen again methods based on the hierarchical linear model can be used. ‘The hierarchical linear model is treated in the following chapters; applications to meta-analysis are given, eg, by Bryk36 ‘Statistical treatment of clustered ‘and Raudenbush (1992, Chapter 7). If studies collected cannot be regarded ‘sa sample from a population, then still the methods mentioned below may be used. ‘There exist various methods for combining evidence from several stud- jes, based only on the assumption that this evidence is statistically independent. They can be applied already if the number of independent studios is at least two, The least demanding method is Fisher's combination of p- values (Fisher, 1932; Hedges and Olkin, 1985). This method assumes that im each of 1 studies a null hypothesis is tested, which reaults in independent p-values p1,... Py. The combined mull hypothesis is that in all of the studies, the null hypothesis holds; the combined alternative hypothes ‘that in at lenst one of the studies, the alternative hypothesis holds. It is not required that the NV independent studies used the same operationalizations for methods of analysis, only that it is meaningful to test this combined null hypothesis. This hypothesis ean be tested by minus twice the sum of the natural logarithms of the p-values, 2a 20h, (836) which under the combined null hypothesis has a chi-squared distribution vith 207 degrees of freedom. Becatge of th shape ofthe logarithmic func. ton, this combined statistic will already have a lage value i at least one of the p-values is very smal. ‘A stronger combination procedure canbe achieved ifthe several studies all ead to estimates of theoretically the same parameter, denoted here by 6 Suppose that the jth study yields a parameter extimate 6 wth standard error sj and that all the studies are statistically independent. ‘Then the combined estimate with smallest standard error is the weighted average with weights inversely proportional to #7 Dy (837) a with standard error sE@)=,/—,. (038) \yxy For example, i standard errors are inversely proportional to the square root of sample size, oj = 0/7 for some value , then the weights are directly Proportional to the sample sizes and the standard error of the combined estimate is o/ Ving: If the individual eaimates are approximately nor- rally distributed, and also (even when the estimates are not nearly normally distributed) when NV is large, the tatio, a SEO” can be tested in standard normal distribution. Combination of within-group evidence ar ‘The choice between these two combination methods can be made as follows. The combination of estimates, expressed by (3.37), is more suitable if the true parameter value (the estimated @) is approximately the same in ‘each of the combined studies, while Fisher's method (3.36) for combining p- values is more suitable if it is possible that the effect sizes are very diferent between the IV studies. More combination methods can be found in the literature about meta-analysis, eg., Hedges and Olkin (1985). Example $.8 Gossip behavior in sts organizations ‘Wittek and Wielers (1998) investigated effects of informal social network structures on goasip behavior in six work organizations, One of the hypothe- ‘es tested was that individuals tend to gossip more if they are invalved in ‘more coalition triads. An individual A is involved in a coalition triad with ‘pro others, B and C, if he has a postive relation with B while A and B both have a negative relation with C. Six organizations were studied which ‘were 30 different that an approach following the lines of the hierarchical m- ‘ear model was not considered appropriate. For each organtzation separately, ‘8 multiple regression wat carried out to estimate the effect of the aumber of coalition triads in which a person was involved on a measure for gossip behavior, controlling for some relevant otber variables. ‘The p-values obtained were 0.015, 0.4, 0.19, 0.13, 0.25, and 0.42. Only one of these is significant (ie, less than 0.05), and the question is whether ‘hie combination of six p-values would be unlikely under the combined null Iypothesis which states that in oll six organizations the effect of caltion ‘tiads on gossip is absent. Equation (3.36) yields the tet statistic x? = 22.00 with d.f.=2 6 =12,p < 0005. Thus the result is significant, which shows ‘hat indeed in the combined data there is evidence that there isan effect of, coalition triads on gossip behavior, even though this effect is signifcant in only one of the organizations considered separately. In this chapter we have presented some statistics to describe bierarchi- cally structured data. We mostly gave examples with balanced data, ie., an equal number of micro-units per macro-unit. In practice, most data sets are ‘unbalanced (except mainly for some experimental designs with longitudinal data on a fixed set of occasions). Our main purpose was to demonstrate how clustering in the data, Le., dependent observations, is not only a nuisance that should be taken eare of statistically, but can also be a very interesting phenomenon, worth further study. Finally, before proceeding with the introduction of the hierarchical linear model of multilevel analysis, t should bbe borne in mind that usually there are explanatory variables and statistically itis the independence not of the observations but of the residaals (the ‘unexplained part of the dependent variable) which is the basic assumption of single-level linear models.4 The Random Intercept Model In the preceding chapters it was argued that the best way to snalyse multi level data is an approach that represents within-group as well es between- sroup relations within a single analysis, where ‘group’ refers to the units at the higher levels of the nesting hierarchy. Very often it makes sense to use probability models to represent the variability within and between groups, in other words, to conceive of the unexplained variation within groups and ‘the unexplained variation between groups as random variability. For a study of pupils within schools, eg, this means that not only unexplained variation between pupils, but also unexplained variation between schools is regarded as random variability. This can be expressed by statistical models with so-called random coeficients. ‘The hierarchical linear model is such a random coefficient model for ‘multilevel, or hierarchically structured, data and is by now the main tool for ‘multilevel analysis. Chapters 4 and 6 treat the definition of this model and the interpretation of the model parameters. The present chapter discusses ‘the simpler cage of the random intercept model; Chapter 5 treats the general hierarchical linear model, which also has random slopes. Testing the various components of the model is treated in Chapter 6. The later chapters treat vyarlous elaborations and other aspects of the hierarchical linear model. The {focus of this treatment is on the two-level case, but. Chapters 4 and 6 also contain sections on models with more than two levels of variability For the sake of concreteness, we refer to the level-one units as ‘indi uals, and to the level-two units as ‘groups’. The reader may fill in other ‘names for the units, if she has a diffrent application in mind; egg. ifthe application isto repeated measurements, ‘measurement occasions’ for level-one tunits and ‘subjects’ for level-two units. The nesting situation of ‘measure- ‘ment occasions within individuals is given special attention in Chapter 12. ‘The number of groups in the data is denoted 1V; the number of individuals in the groups may vary from group to group, and is denoted n, for group 5 (J =1,2,...,N). The total number of individuals is denoted M = 3-jn;. ‘The hierarchical linear model is a type of regression model that is particularly suitable for multilevel data. It differs from the usual multiple regression model in the fact that the equation defining the hierarchical linear model contains more than one error term: one (or more) for each level. ‘As in all regression models, there isa distinction between dependent and ex- ‘Planatory variables: the aim is to construct a model that expresses how the 38 A regression model: fized effects only 39 dependent variable depends on, or is explained by, the explanatory vari ables. Instead of explanatory variable, the names predictor variable and independent variable are also in use. ‘The dependent variable must be a variable at level one: the hierarchical linear model is a model for explaining omething that happens at the lowest, most detailed level In this section, we assume that one explanatory variable is available at cither level. In the notation, we distinguish the following types of indices and variables: jis the index for the groups (j = 1,....); {is the index for the individuals within the groups Assen). ‘The indices can be regarded as case numbers; note that the numbering the individuals starts again in every group. For example, individual 1 in group 1 is different from individual 1 in group 2. For individual ¢ in group j, we have the following variables: Yig i the dependent variable; 143 isthe explanatory variable at the individual level; for group j, we have that 4 is the explanatory variable at the group level ‘To understand the notation, itis essential to realize that the indices, 4 and 4 indicate precisely on what the variables depend. The notation Yi, €, indicates that the value of variable Y depends on the group j and also on ‘the individual é. (Since the individuals are nested within groups, the index ‘mals sense only if itis accompanied by the index j: to identify individual 1, we must know to which group we refer!) The notation 24, on the other hand, indicates that the value of Z depends only on the group j, and not on the individual, i. ‘The basic ‘dea of multilevel modeling is that the outcome variable ¥ has an individual as well as a group aspect. ‘This carries through also for other level-one variables. The X variable, although it is a variable at the Individual level, may also contain a group aspect. The mean of X in one group may be different from the mean in another group. In other words, X may (and often will) have a positive between-group variance. Stated ‘more generally, the compositions of the various groups with respect to X may differ from one another. It should be kept in mind that explanatory variables that are defined at the individual level often also contain some information about the groups. 4.1 A regression model: fixed effects only ‘The simplest model is one without the random effects that are characteristic for multilevel models; it isthe classical model of multiple regression. This ‘model states that the dependent variable, Yis, can be written as the sum of0 ‘The random intercept model ‘a systematic part (a linear combination of the explanatory variables) and a random residual, Vis =P + Bry + tasy + Ray (ay Jn this model equation, the 's are the regression parameters: fy is the intercept (je., the value obtained if xj as well as x; are 0), 8; is the coefficient for the individual variable X, while By is the coefficient forthe group variable Z. The variable Ry; is the residual Gometimes called error); an essential requirement in regreasion model (4.1) is tha all residuals are mu- ‘ually independent and have e zero mean; a convenient assumption i that in all groups they have the same variances (the homoscedasticity assump- tioa) and are normally distributed. This model has a multilevel nature only to the extent that one of the explanatory variables refers to the lower and the other to the higher level. Model (4.1) cam be extended to 2 regression model where not only main effects of X and Z, but also the craw-level interaction eet is present. This type of interaction is discussed more eizhorately in the following chapter Tt means that the product variable ZX = Z = X is added to the lst of explanatory variables. The resulting regression equation is Vig = fo + Batis + Baty + Paty 25 + Ras (42) ‘These models pretend, as % were, that all the multilevel structure in the ata is fully explained by the group variable Z and the individual variable X. If two individuals are being considered and their X- and Z-values are given, then for their Y-value itis immaterial whether they belong to the same, orto diferent groups. Models of the type (4.1) and (4.2), and their exteasions with more explanatory variables at either or both levels, have in the past been widely ‘used in research oa data with a multilevel structure. They are convenient to handle for anybody who knows multiple regression analysis. Is anything wrong with then? YES! For data with a meaningful multilevel structure, it is practically always unfounded to make the « priori assumption that all of the group structure is represented by the explanatory variables. Given that there are only N groups, itis unfounded todo asf one has my +ma-+..+yr independent replications. ‘There is one exception: when all group sample sizes ny are equal to 1, the researcher does not nead to have any qualms about ising these models because the nesting structure, although it may bve present in the population, is not present in the data. Designs with ‘nj =1 can be used when the explanatory variables have been chosen on the basis of substantive theory, and the focus ofthe research ison the regres: Son coefficients rather thaa on how the variability of ¥ is partitioned into within-group and between-group variability. In designs with group sizes lager than 1, however, the nesting structure often cannot be represented completely in the regression model by the explanatory variables, Additional effects of the nesting structuze can be represented by letting the regression coefficients vary from group to group. ‘Thus, the coefficients Ay and A; in equation (4.1) must depend on the group, Variable intercepts: fized or random parameters? 4 denoted by j. ‘Tis is expressed in the formula by an extra index j for these coeficients, This yields the model Vij = Pag + Bay zug + Bay + Ry (43) Groups j can have a higher (or lower) value of fos, indicating that, for any given value of X, they tend to have higher (or lower) values of the dependent variable Y.’ Groups can also have a higher or lower value of Aaj, ‘which indicates that the effect of X on Y is higher or lower. Since Z is'@ group-level variable, it would not make much sense conceptually to let the Cocficent of Z depend on the group. Therefore Ay is lft unaltered in this formula. ‘The multilevel models treated in the following sections and in Chapter 5 contain diverse specifications of the varying coeficients fy and fs. The simplest version of model (4.3) is the version where Poy and fs are constant (do not depend on j), ie., the nesting structure has no effect, and we are ‘back at mode! (41). In this case, the OLS! regression models of type (4.1) and (4.2) offer a good approach to analysing the data. If, on the other hand, the coeficients fy and fy do depend on j, then these regression models ‘may give misleading results. Then it is preferable to take into account, how the nesting structure influences the effects of X and Z on Y. Tis can be done using the random coefficient model ofthis and the following chapters. In this chapter the case is treated where the intercept fay depends ca the group; the next chapter treats the case where also the regression coeicient ‘us is group-dependent. 4.2 Variable intercepts: fixed or random parameters? Let us Bist conser only the rvgresion on the levelone varieble X. A frst step towards modeling between-group variability ito lt the intercept vary between groups. This refects that some groups tend to lave, on average, higher responses Y and others tend to have lover reponse. This model is halfway between (41) and (43) (but omitting th effet of 2), in the sense that intercept Boy does depend on the group but the regression coeicieat of X, fy, is constant: Yig = fag + Aazy + Ry - (44) ‘This is pictured in Figure 4.1. Te group-dependent intercept can be split into an average intercept and the roup-dependent deviation: Bos = 00 + Ung - For reasons that will become clear in Chapter'5, the notation for the re- igrssion coefcients is changed here, and the average intercept scaled 7 while the regresion coefficient for X i aed 7p. Substitution now leads to the model, Yis = 700 + oa + Uy + Ry (45) "Ordinary leat Squares2 ‘The random intercept model ¥ 5 casio line group 2 Pal rear group regression line group 3 Pr regression line group 2 bos wou Bos Figure 4.1 Diferent parallel regression lines. ‘The point yi2 is indicated with ie residual Ria ‘The values Uoy are the main effects of the groups: conditional on an individual having a given X-value and being in group j, the Y-value'a expected tobe Ua higher than nthe average soup. Mel (5) an be undetood in two ways (1) As a model where the Uoy are fzed parameters, N in number, of the statistical model. This is relevant ifthe groups j refer to categories each with its own distinct interpretation, eg, ciamsication according to gender or religious denomination. In order to obtain identified parameters, the restriction that 5°, aj = 0 can be imposed, to tht effectively the groups kad to N — 1 repesion parameters This the ual aaa of cor ance model, in which the grouping variable i a factor. (Since we prefer to tie Grek lee for seaseaprtntrs an ep te ancons variables, we would prefer to use a Gree letter rather than U when we take this view of model (4.5),) In this specification itis impossible to use a grouplevel variable Z; as an explanatory variable because it would be redundant given the fixed group effects. (2) Asa model where the ag are independent identically distributed random variables. Note that the Voy ace the unexplained group effets, which also may be called group residdals, controlling for the eects of variable X. ‘These residuals now are assumed to be randomly drawn from a popalation with zero mean and an a priori unknown variance. This i rlevast if the effects of the groups j (which can be neighborhoods, schools, companies, etc.), controlling for the explanatory variables, can be considered to be exchangeable, There is one parameter associated to the Un inthis statistical model: their variance. ‘This i the simplest random ooeficient regression model. I s called the random intercept model because the roup-dependent intercept, to + Uos, is a quantity which varies randomly fom group to sroup. The groups are regarded as a sample from a population of groups It is posible that there are group-level variables Z; that express relevant Variable intercepts: fized or random parameters? “a attributes of the groups (such variables will be incorporated in the model fn Section 4.4 and, more extensively, in Section 5.2). ‘The next section discusses how to determine which of thse specications cof model (4.5) is appropriate in a given situation. ‘Note tat model (41) and (4.2) are OLS models or fixed eects models, wich donot take the nesting stracture into acoount (exept maybe by the the of grouplevel variable Z,), whereas models of type (1) above are OLS. trodes tht do take the nesting structure foto acount. The latter kind Of OLS model has a mach larger number of regresion parameters, since in Such models N groups lead to N —1 regression coficeat, Tes important to datingulsh between these two kinds of OLS models in discussions about ow to handle data witha mated sitar 4.2.1 When to use random coefficient models? ‘These two different interpretations of equation (4.5) imply that multilevel data can be approached in two different ways, using models with fixed or ‘with random coefficients. Which of these two interpretations is the most appropriate in a given situation depends on the focus ofthe statistical inference, the nature ofthe set of NV groups, the magnitudes of the group sample sizes nj, and the population distributions tavolved. 1. Ifthe groups are regarded as unique entities and the researcher wishes primarily to draw conclusions pertaining to each of these IV specific groups, then it is appropriate to-uee the analysis of covariance model, Examples are ‘groups defined by gender or ethnic background. 2. If the groups are regarded as a sample from a (real or hypothetical) population and the researcher wishes to draw conclusions pertaining to this population, then the random coefficient model is appropriate. Examples are the groupings mentioned in Table 2.2. 3. Ifthe researcher wishes to test effects of group-level variables, the random coefficient model should be used. The reason is that the fixed effects model already ‘explains’ all differences between group by the fixed effects, and there js no unexplained between-group variability let that could be explained by sroup-level variables, ‘Random effects’ and ‘unexplained variability’ are two ‘ways of saying the same thing. 4, Especially for relatively small group sizes (in the range from 2 to 50 or 100), the random coefficient model has important advantages over the analysis of covariance model, provided that the assumptions about the random ‘coeficients are reasonable. This can be understood as follows. ‘The random coefficient model includes the extra assumption of independent and identically distributed group effects Uaj. Stated less formally: the ‘unexplained group effects are governed by ‘mechanisms’ that are roughly similar from one group to the next, and operate independently between the ‘groups, The groups are said to be exchangeable. This assumption helps to counteract the paucity of the data that is implied by relatively small group“ ‘The random intercept model sizes n;. Since all group effects are assumed to come from the same population, the data from each group also have a bearing on inference with respect to the other groups, namely, through the information it provides about the population of groups. Tn the analysis of covariance model, each of the Up; is estimated as a separate parameter. If group sizes are small, then the data do not contain very much information about the values of the Uo; and there will be a considerable extent of overfiting in the analysis of covariance model: many parameters have large standard errors. This overfittng is avoided by using. the random coefficient model, because the Ugy don't figure as parameters. If a the other hand, the group sizes are large (say, 100 or more), then in the analysis of covariance the group-dependent parameters Uj are estimated vory precisely (with small standard errors), and the additional information that they come from the same population does not add much to this precision. In such a situation the difference betwoen the results of the two approaches will be negligible. 5. The random coefficient model is mostly used with the additional assumption that the random coefficients, Upj and Raj in (4.5), are normally distributed. If this assumption is a very poor approximation, results obtained with this model may be unreliable. This can happen, eg., when there are more outlying groups than can be accounted for by a normally distributed group effect Un; with a common variance. Other discussions about the choice between fixed and random coefficients can be found, eg, in Searie etal. (1992; Section 1.4) and in Hsiao (1995, Section 8). An often mentioned condition for the use of random coefcient ‘models isthe restriction that the random coeficients should be independent of the explanatory variables. However, if there is a posible correlation between group-dependent coeficients and explanatory variable, this residual correlation can be removed, while continuing to use a random coefficient model, by also including effects of the group means of the explanatory variables. ‘This i treated in Sections 4.5 and 9.21 In order to choose between regarding the group-dependent intercepts Uay as fixed statistical parameters and regarding them as random variables, a rule of thumb that often works in educational and social research is the following. This rule mainly depends on NY, the numberof groups inthe date. ICN is small, soy, N < 10, then use the analysis of covariance approsch the problem with viewing the groups as a sample from a population is in this case, that the data will contain only scanty information about this population. IfN isnot small, say N > 10, while nis small or intermediate, say nj < 100, then use the random coeficient approach: 10 or more groups is usally too large a number tobe regarded as unique entities. If the group sizes nj are large, say ny > 100, thew it does not matter much which view wwe take, However, this rule of thumb should be take with a large grain of salt and serves only to give a first hunch, not to determine the choice between fixed and random effects Definition of the random intercept modet 48 Populations and populations When the researcher has indeed chosen to work with a random coefficient model, she must be aware that more than one population is involved in the multilevel analysis, Each level corresponds to a population! For a study of ‘pupils in schools, there is a population of schools and a population of pupils. For voters in municipalities, there is 2 population of municipalities and a population of voters; ete. Recall that in this book we take a model-based view. ‘This implies that the population are infinite hypothetical entities, that exprese ‘what could be the case’. The random residuals and coeficients can be regarded as representing the effecte of unmeasured variables and the approximate nature of the linear model. Randomness, in this sense, may be Imerpreted as unexplained variability. Sometimes a random coeflicient model can be used also when the population idea at the lower level is less natural. For example, in a study of longitudinal data where respondents are measured repeatedly, a multilevel ‘model can be used with respondents at the second and measurements at the first level: measurements are nested within respondents. Then the population of respondents is an obviovs concept. Measurements may be related to a population of time points. This will sometimes be natural, but not always. Another way of expressing the idea of random coeflicient models in such a situation isto say that residual (non-explained) variability is present at level one as well a8 at level two, and this nomexplained variability is represented by a probability model. 4.3 Definition of the random intercept model In this text we treat the random coefficient view on model (4.5). This model, the random intercept model is a simple cose of the so-called hierarchical linear model We shall not specifically treat the analysis of covariance model, land refer for this purpose to texts on analysis of variance and covariance. (For example, Cook and Campbell, 1979 or Stevens, 1996.) However, we shall encounter a number of considerations from the analysis of covariance that also play a role in multilevel modeling. ‘The empty model Although this chapter follows an approach along the lines of regression anal- yrs, the simplest case ofthe hierarchical linear mode! is the random effects ‘analysis of variance modeh in which the explanatory variables, X and 2, do not figure. This model only contains random groups and random variation ‘within groups. It can be expressed as a model? where the dependent variable is the sum of a general mean, yop, @ random effect at the group level, Us, and a random effect at the individual level, Ry: Yig = Yoo + Uaj + Rj (46) ‘he same model encountered before in formula (3.1).46 ‘The random intercept model Groups with a high value of Uo; tend to have, on the average, high responses ‘whereas groups with a low value of Uos tend to have, on the average, low responses. The random variables Voy and Ry are assumed to have a mean of O (the mean of Yi; is already represented by oo), to be mutually independent, and to have variances var(Ryj) = o? and var(Uoq) = 72. In the context cof multilevel modeling (4.6) is called the empty model, beeause it contains not a single explanatory variable. It is important because it provides the basic partition of the variability in the data between the two levels. Given rode! (4.6), the total variance of ¥ can be decomposed as the sum of the leveltmo and the leve-one variances, var(Vig) = var(Unj) + var(Ris) = 73 + 0? - The covariance between two individuals (( and @, with 4 ') in the same group j is equal tothe variance ofthe contribution Uy that is shared by these individuals, cov(Yis, Yi) = var(Uos) = 73 » and their cortelation is B oltiss Yes) = Gap aay * ‘This parameter is just the intraclass correlation coefficient py(Y) which we encountered already in Chapter 3. It can be interpreted in two ways: it is the correlation between two randomly drawn individuals in one randomly drawn group, and it is also the fraction of total variability that is due to tthe group level. Example 4.1 Empty model for language scores in elementary schools In this example a data set is used that will be used in examplor in mazy chapters of this book. The data set is concerned with grade 8 pupils (age about 11 years) in elementary schools in The Netherlands, After deleting Dupils with missing values, the number of pupils is Mf = 2287, and the number of schools is N = 131. Class sizes in the original data set range from 10 to 42. Jn the data set reduced by deleting cases with missing data, the clas sizes range from 4 to 35, The nesting structure is pupils within lasses. "The dependent variable is the score on a language test. Most of the ‘analyses of this data set in this book are concerned with investigating how the language test score depends or the pupil's inzeligence and his ot her family’s social-economic status, and on a numberof school or class variables. Fitting the empey model yields the parameter estimates presented in Table 441, Te ‘deviance ia this table i given for the sake of completeness and later reference. This concept is explained in Chapter 6. ‘The estimates 8? = 64.57 and 9 = 19-42 yield an intraclass correlation 0). ‘The reverse, where the betwoun-group regression lie is less steep than the within-group regression lines, can also be the case. Fegression line within group 2 regression line within group 8 regression line within group 1 ‘Figure 4.3 Differst between-group and within-group regression lines. If the within- and between-group regression coeficients are different, then it often is convenient to replace 24; in (4.8) by the within-group deviation score, defined a8 2,;—Z s. To distinguish the corresponding parameters from those in (4.8), they are denoted by 7. The resulting model is Yi = Foo + ao 2s — 25) + Fon Zs + Vos + Ry - (49) ‘This model is statistically equivalent to model (48) but it has a more convenient parametrization because the between-group regression coeficient Fu =n0 +m (410) while the within-group regresion cocficint is fo = 10 (aun) ‘The use of the within-group deviation score is called within-group centering, some computer programs for multilevel analysis have special facilities for this Within- and between-group regressions 55 Example 4.3 Within. ond between-group regressions for 10. ‘We continue example 2 by allowing dilferences between the within-group and between-group regressions of the language score on 1Q. The resus are displayed in Table 44. 1Q bore is the raw variable, Le, without group-centering, In other words, the results refer to model (48) ‘Table 4.4. Estimates for random intercept model with ferent within- and betwoon-group regressions. ised Bet “en = Tieropt “fa = Coefiteat of ‘or = Coeficent of IQ (group meas) Rasdom Effect Variance Component 8. Tevebivo variance gaan) 1 130 Eerctone variance: Fahy) was 128 Deviance snr The withingroupregremion coeficien: 2.415 andthe beeneen group regres sion coeficen is 2415 > 1.590 = 4004. A pup witha given 1Q obtains, on erage higher language tet acre if be or sb i ia clam with higher average 10, Inotber words, the context eect of mean 1Q gives an aditional Contribution ovr and above the eet of individual 1Q. A figure for these re- ‘sults is qualitatively similar to Figure 43 in the sane chat tbe wthio group Tegreasion lines ae le seep than the between-group regression ine. “Tha tablerepreents wits each clas denoted ja nea regression equation Y= 40.74 + Voy + 24151Q + 159979 share Uy i clas-dependen deviation with mean 0 and variance 7.73 (rtaa- Gard deviation 2.78). The wthi-lae deviations aboot tis regression eae tion, Ry, havea vciance of 42.15 (eandard deviation 6.49), Within each clas, the ec: (rogeion cota) of 1Q i 2415, wo the regeson lines fe parle, Clases der in two ways: they may have diftrent mean 1 values, which affects the expected results Y through the term 1.5891Q; this San explained diference berwen the lanes nd thay ave randomly dit ing valves for Un, whi i an explained difterence. These two ingredients contrbite to tbe caardependeat intercept, gven by 40.74 + Us, 4158010. “The within group and berwees-group regrenion concent would be xual iia formula (48), he coficet of average 1Q would be 0, Le, mu = 0. This ill hyphens eat be tered (one Secon 6.1) by th tatiodafied a8 siven here by 1.589/0.313 = 5.08, a highly significant result. In other words, swe may conclude that the within- and between-group regresion coeficients ‘re diferens indeed. If te individual 1Q variable bad been replaced by within-group deviation scores 1Q,,—TQ.,, i.e, model (4.9) had been used, then the estimates obtained56 ‘The random intercept model would have been fio = 2415 and Jon = 4.004, of, formulae (410) and (4.11). Indeed, the regrestion equation given above can be described equivalently by YY = 40.74 + Vay +2415 (1Q - 1) + 4.00419 , ‘which indicates explicitly that the within-group regression confcient is 2.415 ‘while the between-group regression coefcient, i. the coeficient ofthe group means ¥ on the group means IQ, s 4.004 When interpreting the results of a multilevel analysis, itis important to keep in mind that the conceptual interpretation of within-group and between-group regression coefficients usually is completely different. These ‘wo coeflicents may express quite contradictory mechanisms. This is related to the shift of meaning and the ecological fallacy discussed in Section 3.1. rather than the exception that within-group regression coeflicients dlifer from between-group regression coefficients (although the statistical signifi cance of this difference may be another matter, depending as it is on sample sizes, etc). 4.6 Parameter estimation ‘The random intercept model (4.7) s defined by its statistical parameters: the regression parameters 7, and the variance components, and r2. Note that the random effects, Ups, are not parameters ina statistical sense, but latent (ie., not directly observable) variables. The literature (eg., Longford, 1993) contains two major estimation methods for estimating the statistical parameters, under the assumption that the Uoy as well as the Ry are normally distributed: maximum leehood (ML) and residual (or restricted) maximum likelihood (REML). ‘The two methods difer ttle with respect to estimating the regression covflcints, but they do differ with respect to estimating the variance components. A. very brief indication of the difference between the two estimation methods i, that the REML method estimates the variance components while taking into account the loss of degrees of freedom resulting ftom the estimation ofthe regression parameters, whereas the ML method does not take this into account. The renut is thatthe ML estimators for the variance components have # downward bias, and the REML estimators don't. For example, the urual variance estimator fora single-level sample, in which the sum of squared deviations is divided by sample sie minus 1, s a REML est- mator, the coresponding ML estimator divides instead by the total sample 2. The difference can be important especially when the number of groups is small. For a large number of groups (as a rule of thumb, ‘large’ here ‘means larger than 30), the difference between the ML and the REML estimates is immaterial. ‘The literature suggests that the REML method is preferable with expect to the estimation of the variance parameters (and the covariance parameters for the more general models treated in Chap- ter 5). When one wishes to carry out deviance tests (ote Section 6.2), it Parameter estimation or sometimes is required to use ML rather than REMC, estimates ® Various algorithms are available to determine these estimates. They have names such as EM (Expectation - Maximization), Fisher scoring, IGLS (Iterative Generalized Least Squares), and RIGLS (Residual or Restricted IGLS). They are iterative, which means that a number of steps are taken in which a provisional catimate comes closer and closir to the final eatimate. When all goes wel, the steps converge to the ML or REML estimate. Tech- nical details can be found, eg., in Bryk and Raudenbush (1992), Goldstein (1995), or Longford (1988a, 1995). In principle, the algorithms all yield the same estimates for a given estimation method (ML or REML). The diferences are that for some complicated models tae algorithms may vary in the amount of computational problems (sometimes one algoritam may converge and the other not), and that computing time may be diferent. For the practical user, the differences between the algorithms are hardly worth thinking about, ‘An aspect of the estimation of hierarchical lineat model parameters that surprises some users of this model is the fact that itis possible that the variance parameters, in model (4.7) notably parame:er rf, can be estimated to be exactly 0. The value of 0 is then also reported for the standard erzor not mean that the data imply absolute certainty that the population value of 79 Js equal to 0. Such an estimate can be understood as follows. For simplicity, consider the empty model, (4.6). The levl-one residual variance ? is estimated by the pooled within-group variance. The parameter 7 is ‘estimated by comparing this within-group variability to the between-group variability. The later is determined not only by rf but also by 0, since a+e (4.2) Note that, bung a variance, cannot be negative. This imple thet, even if rf = 0, a positive between-group variability is expected. If observed between-group vaiaiity i egal to or smaler than what expected from (412) in ease rf O then the eatimate#@ = Os reported (ek: the discussion folowing 11) Ifthe group sizes ny are variable, the larger groups will, naturally, have a larger influence onthe estimates than the smale: groups. The iafuence of group size onthe etrats i, however tnodiated by te latracss corre lation coufiien:. Consider, eg, the estimation ofthe mean intercept,“ the residual intraclass corelaton is, the groupe have an influence oo the estimated value of eo that proportional to thet sve. Inthe extreme case thatthe residual intraclas correlation is I, ach group has an equally large influence, independent ofits size. In prectce, where the residual intraclass we?) Then moda are compared with diferent fixed part, deiane tnt shold a hand on ML estimation Deviance tts with REML enimates may te used for comparing ‘models with diferent random parts and the eae fed part. Difeent random parte wil be tread fa the next chapter58 The random intercept model correlation is between 0 and 1, larger groups will have a larger influence, Dut less than proportionately, 4.7 ‘Estimating’ random group effects: posterior means ‘The random group effects Uoy are Intent variables rather than statistical Parameters, and therefore are not eatimated as an integral part of the statistical parameter estimation. However, there can be matty reasons why it can nevertheless be desirable to ‘estimate’ them,” ‘This can be done by ‘© method known as empirical Bayes estimation which produces so-called posterior means, see, eg, Etron and Moris (1975). The basic idea ofthis method is that Uo, is ‘estimated’ by combining two kinds of information: (2) the data from group j, (2) he fact (or, rather, the model assumption) that the unobserved Up; is a random variable just like all other random group effects, and therefore a normal distribution with mean 0 and waiance rf * ss Jn other words, data information is combined with population information. ‘The formula is given here only for the empty model, ie, the model rrithout explanatory vatiabies. The idea for more complicated modes is analogous; formulae can be found in the literature, eg, Longford Section 2.10). 8s oe ‘The empty model was formulated in (4.6) as Ys = Poy + Pas = eo + Uoj + Ray Since 1o0 is already an estimated parameter, an estimate for fo; will be the same as an extimate for Uoj plus yoo. Therefore, estimating byy end mating Uoy are equivalent problems given that an estimate for yb If we used only group j, Ary would be estimated by the group mean, which is also the OLS estimate, ” ° Bay =F (413) ‘we looked only at the population, we would estimate Soy by its population ‘mean, 7. This parameter i etimated by the overall mean, mw =F.= EY, where M = ny denotes the total sample size. Another posit ul 5 sample size. Another posit is to combine the information from group j with the population information, The optimal combined ‘estimate’ for (hy is a weighted average of the two previous estimates: Bip = Aj Pa + (1-24) 0 1) ihe word etinate Ue pat between quotation masks because the proper saisicl term for fing Wily values ofthe Us, beng random varias, Is preiction, Te Sern of imation i tered for fading ty aot fr sain prance ction i aocinted in everyday speech, however, with determing tomethingsboct he futur, we prefer to speak bere abot “stmaien between paren ‘Batimating’ random grovp effets: posterior means 59 where EB stands for ‘emplrical Bayes’ and the weight 4 is defined as the relability ofthe mean of group J (ee equation (3.21), -_8 we G+ oF /n; ‘The ratio ofthe two weights, Ay/(L 8 Jas the ratio of true variance 1 to error variance o?/nj. In practice we do not know the true values of the parameters o? and 72, and we substitute estimated values to calculate (414). Formula (4.14) is called the posterior mean, or the empisical Bayes estimate, for Bry. This term comes from Bayesian statistics. It refers to the distinction between the prior knowledge about the group effects, which is based only on the population from which they are drawn, and the posterior knowledge which is based also on the observations made about this group. There is an important parallel between random coeficient models tnd Bayesian statistical models, because the random coefcients used in the hierarchical linear mode! are analogous to the random parameters that ae essential in the Bayesian statistical paradigm. This empirical Bayes estimate is treated from » Bayesian standpoint, eg, in Press (1989, p43), Formula (414) can be regarded as fellows: the OLS extimate (4.13) for group jis pushed a bt toward the general mesn “po. This i an example of Shrinkage fo the mean jas tke is being sed, ein paychometrics, fr the tetimation of true scores. The corresponding estimator sometimes is called the Kelley estimator; se, eg, Kelley (1927), Lord and Novick (1968), or other textbooks on classical peychological test theory. From the definition of the weight Ay iti apparent that the infuence of the data of group j itself becomes larger as group size ny becomes larger. For large groups, the posterior mean is practically equal to Ay, the intercept that would be tstimated from data on group j alone Tn principle, the OLS estimate (413) and the empirical Bayes estimate (4.14) both are sensible procedures fr estimating the mean of group j- The former does not need the assumption that qroup j is random clemest from the population of groups, and is an unbiased estimate. The latter ' biased toward the population mean, but for a randomly drawn group it has a smaller mena squared error. The squared err averaged overall troaps wil be smaller for the empirical Bayes estimate, but the price is @ Conservative (drawn tothe average) appraisal of the groups with trly very high ot very low values of fy. The estimation variance of the empirical Bayes estimate is var (25 - By) ==) B (4.15) ifthe uncertaiaty due tothe estimation of po (whch sof secondary importance anyway) is neglected. This formula also is well-known from classical psychological tet theory (eg, Lord and Novick, 1968) —ataonca wee Hat te average f may ~ pote ~lndependentrepicn tous f this estimate for this pactiular group J would be very clogs to the true valde Bos6 The random intercept model The same principle can be applied (but with more complicated formulae) to the ‘timation’ of the group-dependent intercept fay = 900 + Uay random intercept models that do include explanatory variables, such as (4.7). This intecept can be ‘etimated” again by yop plus the posterior mean of Uoj, anc is then also referred to asthe posterior intercept. Tnstead of being primaiily interated in the inwrcept as defined by vo + Vay whic is the value of the regression equation for all explana tory variables having the value 0, one may aso be interested in the value of the regression line for group j forthe case where ony the level-one variables iy t0 py are O while the level-two variables have the values proper to this sroup. To ‘estimate’ this version of the intercept of group j, we use fm + ony on + font + OFF» 418) where the values 7 indicate the (ML or REML) estimates of the regression coefficients. These values alo are sometimes called posterior intercepts ‘The posterior means (4.14) can be used, eg, to see which groupe have unexpectedly high of low values on the outcome variable, given their values on the explanatory variables. ‘They can also be used in a residual analysis, for checking the assumption of normality for the random group effects, and for detecting outer, cf. Chapter 9. The posterior intercepts (4.16) indicate the total main efect of group j, controling for the leverone variables X; to Xp, but including the effects of the lev-two variables Z; t0 Zp. For example, in a study of pupils in schools where the dependent variable isa relevant indicator of scholastic performance, thee posterior intercepts could be valuable information for the parents indicating the contribution of the various schools tothe performance of ther beloved children. Example 4.4 Posterior means for random data. ‘We can illustrate the ‘estimation’ procedure by returning to the random digits table (Chapter 2, Table 3.1). Macro unit 04 in that table has an average of Yj = 315 over its 10 random digits. The grand mean of the total 100, random digits is ¥., = 47.2. The average of macro unit 04 thus seems to be far below the grand mean. But the reliability of this mean is aly 2, 26.7 /{26.7 + (769:7/10)} = (414), the posterior mean is calealated a8 0.25 x315 + (1-025) x472 =433 ‘In words: the pesterior mean for macro-unit 04 is determined for 78 percent (e, 1= 3) by the grand mean of 47.2 and by only 25 percent (i.e, 33) by its OLS mean of $1.5. The shrinkage to the grand mean is evident. Because of the low estimated intraclass correlation of f, = 0.08 and the low number of observations per macro unit, n; = 10, the empirical Bayes estimate of the average of macro unit 0¢ is closer to the grand mean than to the group mean. 4.7.1 Posterior confidence intervals ‘Now suppose that parents have to choose a school for thelr children, and that they wish to do so on the basis of the value a school adds to abilities that pupils already have when entering the school (as indicated by an 1Q test) ‘Estimating’ rondom group effects: posterior means a Let us focus on the language scores. ‘Good! schools then are schools where pupils on average are ‘over-achieves’, that is to say, they achieve more than Expected oa the basis of their 1Q. ‘Poor’ schools are schools where pupils fon average have language scores that are lower than one would expect given. their 1Q scores. Tn this case the level-two residuals Uo; from a two-level model with language a8 the dependent and 1Q as the predictor variable convey the relevant information. But remember that each Up, has to be estimated from the data, and that there is sampling error associated with each residual, since we work with a sample of students from each school. Of course we ‘might argue that within each school the entire population of students is studied, but in general we should handle each parameter estimate with its associated uncertainty since we are now considering the performance of the ‘school for a hypothetical new pupil at this school. ‘Therefore, instead of simply comparing schools on the basis of the level- two residuals it is better to compare these residuals taking account of the associated confidence intervals. ‘The standard error of the empirical Bayes estimate is smaller than the mean squared error of the OLS estimate based on the data only for the given macro-unit (the given school, in our example). This is just the point of using the empirical Bayes estimate. For the empty model the standard terror is the square root of (4.15), which can also be expressed as 1 5B. (Ag?) = . 1.17) SE. (657) [Fant (aa ‘This formula also was given by Longford (19932, Section 1.7). Thus, the standard error depends oa the within-group as well as the between-group variance and on the sumber of sampled pupils for the school. For models with explanatory variables, the standard error can be obtained from computer output of multilevel software (see Chapter 15). Denoting the standard error for school j shortly by SEj, the corresponding ninety percent conf- dence intervals can be calculated as the intervals (a — 1.64 x SE,, ARP + 1.64 x SE) « ‘Two cautionary remarks are in order, however. In the first place, the shrinkage construction of the empirical Bayes es timates implies a bias: ‘good’ schools (with a high Ugs) will tend to be represented too negatively, ‘pocr’ schools (with a low Ugj) will tend to be represented too positively (especially if the sample sizes are small). The smaller standard error is bought at the expense of this bias! These confidence intervals have the property that, on average, the random group ef fects Uoj wil be included in the confidence interval for ninety percent of the ‘groups. But for close to average groups the coverage probabilty is higher, while for the groups with very low or very high group effects the coverage probability will be lower than ninety percent.62 The random intercept model In the socond place, users of such information generally wish to compare 2 series of groups. This problem was addresed by Goldstein and Healy (1995). The parent in our example will make her or his own selection of schools, and fi the Parent is a trained statistician) will compare the schools onthe baal of whether the confidence intervals overlap. In that case, the parent is impli citly performing a series of statistical tests on the ferences between the sroup effects Voy. Goldstein and Healy (1995, p. 175) write: ‘It a come ton statistical misconception to suppase that two quantities whose 96% confidence intervals just fil to overlap are significantly diferent at the 59% significance level. ‘The reader is referred to ther article on how to adjust the width of the confidence intervals in order to perform such significance testing. For example, testing the equality ofa series of level-two residuals, 2 the five percent significance level, requires confidence intervals that are constructed by multiplying the standard error given above by 1.39 rather than the well-known five percent value of 1.96. For a ten percent sigaifcance level, the factor is 1.24 rather than 1.64. So the ‘comparative confidence totervals’ are allowed to be narrower than the confdeace intervals used for assessing single groups. Example 4.5 Comparing aided value of schol ‘Toble 42 presents the multdevel model where language sores are controlled {or 10. The pomeior means Af, which can be interpreted aa the etinated talve added, are graphically prowated in Figare 4. The figure alo preosts {he confidence intervals for testing the equality of any pais of resivale ae Significance level of five percent. For convenience the schools are ordered on : ‘the sizeof thelr posterior mean, Us Figure 4.4 The added value scores for 131 schools vith comparative posterior confidence intervals, Note that approximately 30 schools have confidence intervals that ovelap ‘he confidence interval ofthe best school in this sample, implying that they value-added scores do not difer significantly. At the lower ex:reme, also aboxt ‘Three-level random intercept models 63 sou tt ey ot en ‘ae i soy tn cana a nf og ne ca angle 48 Posterior opener nol fr rondo dat {ett bk oer ut testo dit cmp i 5 wo ‘vied sa ef te I tacts (rp ttn cg or he OF Eerste hn pote mena with catdence tele Vas 7 x 2 7 6 é 10 igure 4.5 OLS means (x) and posterior means (+) ‘with comparative posterior confidence intervals boerve imlage since the OLS means (x) Once again we clearly obverve the shrinkage since the OI a) fare further apart than the posterior means (e). Further, as we would expé ‘rom a random digits example, none of the pairwise comparisons results in any significant diferences between the macro-units since all 10 confidence intervals overlap, 4.8 ‘Three-level random intercept models sree-level random intercept mode! i a straightforward extension of the twoleve mel In pei camp, daa wee ed wh set ete nested within schools. The actual hierarchical structure of educational data is, however: students neted within classes nested within schools. Other examples are: siblings within families within neighborhoods, or people within regions within states, Les obvious examples are: students within cohorts within schools, oF longitudinal measurements within person within groups. ‘These latter cases willbe ilustrated in Chapter 12 on longitudinal data. For the time being we concentrate on ‘simple’ three-level hierarchical data, structuses. ‘The dependent variable now is denoted by Yiu, referring to, pup fin class jn stool More general, one cna about evel one unit jin level-two unt j in level-three unit &. The three-level model64 ‘The random intercept model such data with one explanatory variable may be formated as a regression model Yign = Boge + Bitige + Rage > (4.18) where Aa is the intercept in level-two unit § within lee-three unit k. For the intercept we have the leveltwo model, Poin = b00n + Uojn » (419) where don isthe average intercept in leve-three unit k. For this average intercept we have the leve:three model, oan = “Yooo + Voor (4.20) “This shows that now there are three residuals, as theres variability on three levels Their vaianoes are denoted by var(Rij,) = 07, var(Uop (42) ‘The total variance between all level-1 units now equals o? +7? +", and the pepulation variance between the level-two units is r? +". Substituting (4.20) and (4.19) into the level-one model (4.18) and using (in view of the ext chapts) the triple indexing notation 7h fo the regression coefcient Py yields Yiu = tow + ov 2un + Voor + Vase + Rie (422) Example 4.7 A three-level model: students in classes in schools For this example we we a dataset on 3792 students in 280 classes in 57 secondary schools with complete data (see Opdenakker and Van Damme, 1997), At school entrance students were administered testa on 1Q, mathe- ‘matics ability, schievement motivation, and furthermore data was collected fon the educational level of the father and the students’ gender. "The response variable is the score on a mathematics test administered at ‘the end of the second grade of the secondary school (when the students were approximately 14 years old), Table 4.5 contains the results of the analysis of the empty three-level model (Model 1) and a model with a fixed effet of the students’ intelligence, ‘Table 4.5 Estimates for threo level model Morel Mots 2 Fiat ee Coie SE. aE on = Tea a6 9B 45S “Ree = Gacicnt of 19 oi cats Random Bis Var. Comp. SB. Var. Comp 8 Taso arian gern) nam 081108 O87 Feo tes eotene: pu varWon) ime 028 orm out8 Ea o rae ose 9106s Deviance 190007 son ‘Thrwe-level random intercept models 6 ‘The total variance is 11.686, the sum ofthe three variance components, Since this is a threolevel model there are soveral kinds of intraclas correlation ‘coeficient. Of tbe total variance, 2.124/11.686 = 18 percent is situated at the school level while (2.214 + 1746)/11.686 = 83 percent i situated at the class and school level. The level-tizee intraclass correlation expressing the keness of students in the same schools thus is estimated to be 18 percent, while the intraclass correlation expressing th likes of students inthe samme ‘asses and the same schools thus is extimated to be 0.33. In addition, one can als estimate the intraclass correlation that expresses the likeness of classes in the same schools. This leveltwo intraclass correlation is estimated to bbe 2.124/(2.124 + 1.746) = 0.55. This is more thax 0.5: the school level contributes slightly more to variability than the class level. The interpretation 's that if one randomly takes two classes within one schock and calculates the average mathematics achievement level in one of the two, one can predict reasonably accurately the average achievement level in the other class. Of course we coald have estimated a twolevel model at wall, ignoring the clase level, but that would have led to a redistribution of the cla-level variance ‘to the two other levels, and it would affect the validity of hypothesis texts for Added fixed effets. “Model 2 shows that the fixed effect of 1Q is very strong, with a tratio (cf, Section 6.1) of 0.121/0.005 =24.2. (The intercept changes drastically ‘because the 1Q score does not have a zero mean; the conventional IQ scale, ‘with a population mean of 100, was used.) Adding the effect of 1Q lends to a stronger decrease in the clas- and school-level variances than in the student- level variance. Tis suggests that achools and clams are rather homogeneous with respect to 1Q and/or that intelligence may play its role partly at the school and class levels. As in the two-level model, predictor variables at any of the three levels can be added. All features of the two-level model can be generalized to the three-level model quite straightforwardly: significance testing, model building, testing the model ft, centering of variables etc., although the researcher should be more careful now because of the more complicated formulation. For example, fora level-one explanatory variable there can be three kinds of regressions. In the school example, these are the within-class regression, the withinschool/between-class regression, and the between-school regres sion. Coefficients for these distinct regressions can be obtained by using the class means as well as the school means as explanatory variables with fixed effects Example 4.8 Within-class, btween-lass, and betueen-school regressions. Continuing the previous example, we now investigate whether indeed the ef- {ect of 1 is in part a claselevel ot school-level effect: ia other words, whether ‘the within-las, between-clasy/within-school, and betwees-school regressions are different. Table 4.6 presents the results Jn Model 3, the effocts of the class mean, 10 j., as well as the school ‘mean IQ, have been added. Tle class mean has a clearly significant elect (¢ = 0.106/0.013 = 8.15), which indicates that berweea-class regressions are diferent from within-class regressions. The school mean does not have a6 The random intercept model sipicaat eftct (€= 0980/0028 = 138), s0 this wo evdesce tat the $eitxce stool rogsios ae diferest fom the beweenlas reprenions BieS be conta thar thecompostion with repect to atllgente plays ol a the clas love, but ota the schoo! level ‘Table 4.6 Estimates for three-level model with distinct within-class, within-school, and between-school regresions. Model 3 Mode 4 Fi E Caeicieat SE cent SE Trereapt iai6 290 Ina Coaicient of 1G 0107 0.005 Coeficent of Qn — 10 yy 0107 0.08 Coeficent of Qj, 0.106 os Coufitent of jn 7B. oz oo Coofiient of TO. 00s 0.0m 0.25205 Random Beets Var. Comp. SE. Var. Comp. SE. Level tire sorta? m3 E__Was Comp. SE variou) ome oat ome oat evel eo variance: varaja) 0453 0.089 os .088 Tevet one variance: arc) 603 01st 689k Deviance mamas 195243 Like in Section «5 replacing the variables by the deviation scores fads to an equivalent model formulation in which, however, the within-lse, between. ‘lass, and betweea-achool regression coeficients are given directly by the Sxed parameters. In the three-level case, this means that we must use the following three variables: Wi Ta, the within-class deviation score ofthe student from the class mean; Tj. —TG.4, the within-school deviation score of the class mean ffom the school mean; R., the school mean itself ‘The results are shown 38 Model 4 ‘We se here that the withi-caes regression coedicient is 0.107, equal to the coeficient of student-Level IQ in Model 3; the between clas within school regresion coefficient i 0.212, equal (up to rounding errors) to the sum ofthe student-leel and the claselovel coeficients ia Model 3; while the between school regression coelicent is 0.252, equal to the sum ofall three coefficients ‘in Model 3. From Model 3 we know that the diference between the last two coefcients isnot significant 5 The Hierarchical Linear Model Im the previous chapter the simpler case of the hierarchical linear model ‘was treated, where only intercepts are assumed to be random. In the more general case, slopes may also be random. For a study of pupils within schools, ¢g., the effect of the pupil's intelligence or socio-economic status on scholastic performance could differ between schools. This chapter presents the general hierarchical linear model, which allows intercepts as well as slopes to vary randomly. ‘The chapter follows the approach of the previous corte: most attention ig paid to the case of a two-level nesting structure, and ‘the level-one units are ealled — for convenience only ~ ‘individuals’, while the level-two units are calid ‘groups’. The notation is also the same. 5.1 Random slopes In the random interept model of Chapter 4, the groups difer with respect to the average value of the dependent variable: the only random group effect 's the random intercept. But the relation between explanatory and dependent variables can difer between groups in more ways. For exain in the educational field (nesting structure: pupils within clasrooms), iis possible that the effect of soclo-econome status of pupils on thelr scholastic Achievement is stronger in some clasirooms than in ochers. AS an example in developmental psychology (repeated measurements within individual subjects), it is possible that some subjects progress faster than others. In the analysis of covariance, this phenomenon is known as heterogeneity of regressions across grour, or as group-by-covarate interaction. In the hier achical Iinear mode, itis modeled by random slopes. Tet us go back to a mode! with group-specific regressions of Y on one level-one variable X only, lke model (43) but without the effet of Z, Yig = Boy + By zig + Ry Gl) ‘The intercepts oy as well as the regression cdefficients, or slopes, fag are sroup-dependeat, ‘These group-dependent coeicents can be spit into an Sveragecoeficient and the group-dependent deviation: aj = 900 + Uns Pry = mo + Us 2) er68 ‘The hierarchical linear model Substitution leads to the model Vis =o + nozy + Uny + Uis2y + Ry. 63) 1 fi assumed here that the leveLtwo residuals Voy and Ui, as wel! as the Jevelone residuals Ray have means 0, given the values of the explanatory Yarlable X. Thus, 7p isthe average resressice coefcient just lke spo the average intercept. The first part of (6.3), Yoo + oz, is called the fired part of the model. The second part, Vos +U;,24j + Ray, scaled the random port. The ter Ciy2y can be regarded as a random interaction between grovp gad X. This model implies that the groups are characteriad by two tes, dom effects: their intercept and their slope. We say that X has @ random slope, of a random effect, oF a random coaficient. ‘These two group afocte will ususlly not be independent, but correlated. It is ‘assumed that, for aiferent groups, the pais of raniom effects (oj, U;s) are independent and ‘identically distributed, that they are independent of the evel one residvala ‘egy and that all are independent and identically distributed. ‘The vari ance of the level-one residuals Ri; is again denoted 07; the vatiaaces and ‘covariance of the level-two residuals (Uj, Ui,) are denoted as follows, var(Uo4) mo=8 way) = marty 64) covUoss is) = rm. ‘ust lke in de preceding chapter, one can say that the unexplained group ‘ffets are assumed to be exchangeable. 5.1L Heteroscedastcity ‘Model (5.3) implies not only that individuals within the same group have correlated ¥-values (recall the residual intraclass correlation coefficient of Chapter 4), but also that this correlation as well as the variance of ¥” ase dependent on the value of X. In an example, this can be understood as follows. Suppose that, in a study of the effect of socio-economic status (SES) on scholastic performance (¥), we have schools which do not difer in thelr effect on high-SES children,’ but which do differ in the effect of SES on ¥ (eg, because of teacher expectancy effects). ‘Thea for children {om a high SES background it does not matter which school they go to, Dut for children from a low SES background it does. The school then ade 4 component of variance for the low-SES children, but not for the high: SES children: as a consequence, the variance of Y (for a random child at random school) will be larger for the former than for the latter children, Further, the within-school correlation between high-SES children will be ni, whereas between low-SES children it will be positive ‘This example shows that model (6.3) implies that the variance of Y, given the value z on X, depends on 2. This is called heteroscedasticity in ‘the statistical literature. An expression for the variance of (6.3) is obtained ‘2s the sum of the variances ofthe random variables involved pis a term de- Random slopes 6 pending on the covariance between Voy and Uis (the other random variables fre uncorrelated). Using (5.3) and (54, the result i varthiy | 24) = 18 + 2iyay + Bal tot 65) ‘and, for two diferent individuals (i and #, with i ##) in the same group, cov(¥igs Ye | Bes B04) = 15 + mi (By +245) + hay zs- (56) Formula (4) imple thatthe mua variance ofY minal fo sy = ran This nde by difeentaton with repos to 24.) When this value is within the range of possible X-values, the residual ey ist eves and then ieee agi hi wae sna than all X-alus, then the residual variance is an increasing function of z; if ti larger ail X-valus, then the residual variance is decreasing 5.1.2 Dont force 7 to be OF ; ‘ve the procding dacuson imps that the group effects depend on 2: securing to (6), tis eft sven by Ua + Uy 2, This ustated by Figure 5.1. It gives a hypothetical graph of the regression of s achievement (¥) on intelligence (X). school 1 yon) y@)) ya Figure 6.1 Diferent vertical axes a 8 It is lear that there are slope diferences between the three schoo! Looking at the ¥")-axls, there are almost no intercept differences between the schools. But if we add a value 10 to each inteligence sore, then the Yat sted tothe et by 10 wits the Ya, Now shoo the best schol I the wort: thee are strong intercept diffrence. If we woul hhave subtracted 10 from the z-scores, we would have obtained the ¥) axis, with again intercept diferences but now in reverse order. This implies that the intercept variance rand also the intercept-b-slope covariance 7a, depend on the origin (O-alue) for the X-variable. From this we can lear two things: : ig sciences is arbitrary, 1) Since the origin of most vaniables in the socal sciences frat slope model the ltrep slope covaans shoud be 8 Bet0 The hierarchical linear model Parameter estimated from the data, and not « priori constrained to the value 0 (ie,, left out of the model). {2) Io randou siope models we should be careful with the interpretation of the intercept variance and the intercept-by-slope covariance, since the {intercept refers to an individual with = 0. For the interpretation of these parameters itis helpful when the scale for X is defined so that =~ 0 hase Preferably as a reference situation. For example, jp sxpested measurements when X refers to time, of measurement number, correspond to the start, or the end, of the measurements, In nesting structares of individuals within groupe, itis often COavenfent to let = = 0 correspond to the overall mean of the population or the sample ~ eg, if X is 1Q at the conventional scale with mean 100: it advised to subtract 100 to obtain « population mean of 0. 5.19 Interpretation of random slope variances For the interpretation of the variance of the random. slopes, r?, it is lumi nating to take also the average slope, +9, into consideration, Model (5.3) implies thatthe regression coeffclent, or slope, for group ir-no +0, This isa normally distributed random variable with mean no and wanded deviation n= V/9Z. Since about 95 percent of the probability of a nor 2a distribution is within two standard deviations fom the meas, ellos that approximately 95 percent ofthe groups have slopes between nny and no +2n. Conversely, about one in forty groupe has a slope ea thet, Yo ~2n and one in forty has slope steeper than ng + dre Example 5.14 random slope for IQ. Ne continue the examples of Chapter 4 where the effect of 1Q on a language {cst score was studied. Recall that 1Q is hereon a scale with mean U aed ge set is 207. A random slope of 1 is added to My =m + nosy + 12s + Uy + Use + Ry ‘The results can be read from Table 5.1. Note that the heading ‘Leveltwo random effects refers to the random intereept and random slope which wo ancote effects associated to the level-two units (the cla), but that the vace ‘able tbat bas the random slope, 1Q, i itself a levelone vetable: Figuee 5.2 presents a sample of fifteen regresion lines, raadomly chosen according to the mode of Table 5.1. (The values ofthe group mean {9 were choses randomly from a normal distribution with mean 0.127 and stendad eviation 1.005, which are the mean and standard deviation of the group ‘mean of 19 inthis dataset.) This gure thus demonstrates the populatios of ‘regression lines that characterizes, aecording ta this model, the population of schools. Random slopes a ‘Toble 5.1 Estimates for random slope mode Pied Eft Conciet oa Concent 019 on = Conficient 21 (group mea) Random Beet Tevet rondo GREE afar) wear) ty = cov Uj Us) Tevet-one vationce ‘us i Deviance 12135, ¥ 0 2 25 or 2 3 & x=19 secording to the model Figure 5.2 Fifteen random regression ines according of Table 5.1 (with randomly chosen intercepts and slopes) eit tn pe et ti Scenes ecrane ca ae come Sha de te se sated i moe Eiiie Gait Woiaie act die Sanna eh ero ee ena Hinrich ices tgs acs eee rere epee eet eer cay ee ee eee tt mem -0.82//752 x 0.2 = —0.65. Recall that all variables are cente pe, they,n ‘The hierarchical Knear model have ze mena), 0 that the itereet coreponds to the language test oe fora pupil wih an average ineigesce in a dase with an average mean TRalignce Tis eptiecorreation beeen slope and intecop mene that Class with higher peformance fru ppl of svrgeftellgence hve lower witinclns ec of aeligece. hus, the higher average performance tends to be achleed more ty hher language sete e te le EESSay aguer score ofthe mare eligane Sula te ntti a'a random slope model, the wishin group caberence cannot be sm nas Baie cng ee at ii {acon is that, interna ofthe preent example, the correlation beowen pupils in the ame cass depends on thet italigence. Thus the extent to which a given classroom deserves to be called ‘goo! asia across Pupils To investigate how the contribution of classrooms to papi’ performance depends on 1Q, consider the equation implied by the parameter entimater Yig = 40.75 + 2.45010, + 1.4051 + Vey +Uis1Q4, + Ray Recall fom example 2 thatthe standard deviation ofthe 1Q score is about 2, and the mean is 0. Hence pupils with an inteligence among the bottom few percent or the top few percent have Q scores of about <4. Subttutag those values inthe contribution ofthe random effects gives Uj Uy. Ie follows from equations (65) and (6) that for pupils with TQ = "4, we have var(¥y |1Qy = ~4) 7.92 +2 x (~0,820) x (—4) + (~4)? x 0.200 + 41.35 = 50.03, con(¥s.¥or 110, ) =792 = 16% 020 = 472, va (10, = 4) = 7192 ~8 x 0820-+ 16 x 0.200-+ 4135 = 4593, and therefore PAYS Yrs |1Qy = 4,199, = 8) = AT a ‘s )= Fate BH ee enor, the language tt sae of he or lige and the lot integer pupils in the same class are positively corelated over the population of lasses: clases that have relatively good results forthe less able tend also to have relatively good results for the more able stents This positive corelation corresponds to the rerult tat the value of 1 for which the variance given by (58) is minimal, is outside the range from ~ “+4. Por the etimates in Tale 5.1, this variance ria var(Kj|1Qy ==) =792 ~ 1642 +022" + 0? "gut oO the Give of hfe of il that he ase ninimal for = = 1.64/04 = 4.1, just outside the IQ range fom ~4to +4. This again implies tha clases tend mostly to perform elses higher of loves, ovet the entire range of IQ, This is illustrated lao by Figure 52 (which, however, also contains some regresicn lines that cron each other within the range of 19). 5.2 Explanation of random intercepts and slopes Regresion amas als at explanng varity inthe ostnme (Le, de pendent) arable Explanation a undertaod herein quite lied ney Bzplanation of random intercepts and slopes 3 vi, as being able to predict the value of the dependent variable from know- Jedge of the values of the explanatory variables. The unexplained variability in single-level multiple regression analysis i just the variznce ofthe residual term. Variability in multilevel data, however, has a more complicated structure. This is related to the fact, mentioned in the preceding chapter, that ‘several populations are involved in multilevel modeling: one popilation for ‘each level. Explaining variability in a multilevel structure can be achieved by explaining variability between individuals but also by explaining variabil ity between groupe; if there are random slopes as well as random intercepts, af the group level one could try to explain the variability of slopes as well as intercepts, ‘In the model defined by (6.1) ~ (6.3), some variability in ¥ is explained by the regression on X, ie., by the term 10; the random coefficients Vag, Usy, and Ras each express different parts of the unexplained variability. In order to try to explain more of the unexplained variability, all three of these can be the point of attack. In the fir place, one can try to find explanations in the population of individuals (at level one). ‘The part, of residual variance that is expressed by o? = var(Ry) can be diminished by {neluding other level-one variables. Since group composit’ons with respect to level-one variables can differ from group to group, inclusion of such variables ‘may also diminish residual variance at the group level. A second possibility 's to try to find explanations in the population of groups (at level two). If wwe wish to reduce the unexplained variability associated with Uay and Us, ‘we can algo say that we wish to expand equations (5.2) by predicting the ‘group-dependent regression coefficients fog and 643 from level-two variables Z. Supposing for the moment that we have one such variable, this leads to regression formulae for fay and 64; on the variable Z, aj = 900 + 01 25 + Vos (37) Bj = mn0 + ny + Us (68) In words, the fs are treated as dependent variables in regression models for the population of groups; however, these are ‘latent regressions’, because the {8s cannot be observed without error. Equation (5,7) is called an intercepts fas outcomes model, and (5.8) a slopes a8 outcomes model.” 5.2.1 Cross-level interaction effects This chapter started with the basic model (5:1), reading Yas = Boy + Bag ty + Ras Substituting (5.7) and (5.8) in this equation leads to the model Yes = (roo + ro125 + Vos) + (no + maz] + Ursley + Res Tie the cider eaatore, these juatlo were applied tothe exited groupise re rela coefiiets rather than the Itent coaficients, The statistical etimation thea fos cared out in two stages Sint ordiaary east aqoares (‘0:S") eatimation withia TE Group, nee OLS eiimation with the mmated coficients a cutcomss atic jefllent sod does not diferestate the "roe score’ arabilty of te latent ‘halons fom the sampling variablty of the estimated groupwise reresion coeli- ‘ents. We do pot treat this twostage method.4 ‘The hierarchical linear model 0 + mo124 + Mozy + m2 (69) Uy + Urn + Ry. ‘he last expreasion was reavrangad so that fst comes the fixed part and then the random part. Comparing this with model (2.3) shes tet ane Which remains Uo; + Uij zis + Ry. However, itis to be expected that the residual random intercept and slope variances, 1? and 1?, will be lose than their counterparts in model (5.3) because part of the variability of intercept and slopes now is explained by Z. In Chapter 7 we wil see, however, that ‘this is not necessarily so for the estimated values of these parameters, In equation (5.9) we see that explaining the intercept fy by « level-two Te Z leads to s main effect of Z, while explaining the coefficient yy of X by the level-two variable 7 leads to a product interaction efat of ait Z. Such an interaction between a level-one and a level-two variable is called a cross-level interaction. For the defii ‘4244, the main effect coefficient m9 of X is to be interpreted as the cfiect of X for cases with Z = 0, while the main effect coeficent nor of 7 's to be interpreted as the effect of Z for cuses with X = 0. Example 8.2. Crosplevel interaction between 19 and group size ‘The group sit of the school clases yields a pictial explanation of the clase: dependent slones. Group size ranges from § to 31, with a mean of 201 ‘The school varlable Zs is defzed as group size minis 281.” (The name Z, 's implicitly wed already for IQ, the group mean ef 1Q.) Whes this aeiabie bs added to the model of Example 5.1, the parameter estimates presented ie ‘Table 5:2 are obtained, ‘The value of Za ranges from about -18 to about +14. The clase dependent regression cocficient of 10, cf. (58), is ye + ‘a2, + Uyy, eximeted as 2443 0.022554 Uiy. Fr 4 ranging betwoen —18 anil 414, the feed port of Cross-level interactions can be considered on the basis of two different Kinds of argument. The above presentation is in line with an inductive argument: if a researcher finds a significant random slope variance, she ‘may be led to think of level-two variables that could explain the random slope. An alternative approach is to base the cross-level interaction on substantive (theoretical) arguments formulated before looking at ehe data, Beplanation of random intercepts and slopes 6 ‘table 5.2 Batimates for model with random slope and crom-level interaction. eee Coxe eee Se a once of 1 2 Content 1G 138 Coates of 2 ost 73 =Cocicnt of Z5x1Q 022 fmponent_ SE. Random Effct____Varance Componert_S.6. “Tevet too rendom Der paves) 1 19 waved) a om ee) am oss eee a6 Ls m Devian The reearchr hen i ed to fatimate and ts the rom evel interaction fect irespetive of whether aan slope vsance maa found, If on level interaction effect exist, the power of the statistical test, x effect is considerably higher thaa the power of the test for the coresponding random slope (assuming that the same model serves asthe nul) hypothesis} Taerfor it ono contradict t ok for a specie cos eve interaction even ifno significant random slope was found. This is further elaborated in the last part of subsection 6.4.1. More variables ‘The preceding models can be extended by including more variables that have random ees, and more variables epenng thse random cs Sappose that thee arp leveLene explanatory varies Xin. Xp and ¢ lee explanatory vaables Ziq, Then I the rere! nt alraid of a model with too many parameters, he ean consider the m: wher all Xara have varying Hope, and whee the random intercept a5 well as all these slopes are explained by ail Zvaviables. At the within- soup level, Le, for the individuals, the model then is @ regression model with p variables, tas # ‘sx Vij = Bag + Bag uy +--+ Bos tps + Pas : ‘The explanation of the regresion coefficients os to By Is based on the ‘between-group model, which is a g-variable regression model for the group- dependent coefficient Js, em) Bas =o + 7a 24j + oe + Me zas + Urs» . Substitution of (6.11) n (5.10) and rearrangement of terms then yields the model76 ‘The hierarchical near model Yer + Srwsw + Samay + Sawaya ine z Hay + Stiga + Ry (51 This shows that we obtain main effects of each X and Z variable as well all crosslevel product interactions. Further, we see the reason why, in formula (4.7), the fixed coeficients were called 7g for the levelone variable Xn and roe for the levetwo variable Zs. The groups are now characterized by p+ random coeficients Voy to Uy ‘These random coefficients are independent between groups, but may be o2, ‘elated within groups. It is assumed that the vector (Uay,.-,Uyy) is inde. Pendent of the level-one residuals Rij and that al residuals have Ropulation ‘means 0, given the values of all explanatory variables. It is also sesumed that the levol-one residual Ry has a normal distribution with constant vest, ance o” and that (Uoj,...,Ups) has a multivariate normal distribution with & constant covariance matrix. Analogous to (5.4), the variances and covarl ances of the leveltwo random effects are denoted var(Uns) = ts = 18 (h= 1.040); covtUass Tay) = me (k= 1,...4p) 6.13) Example 5.3 A model with many fced effects. For this example, it must be noted that there isa distinction between class and group. The clas isthe set of pupils who are being taugbs fy the sae teacher in the same clasroom. The group isthe subset of those yp is ake Gass who are in grade 8. Some clases are combinations of grade 7 and grade 8 pupils Only the grade 8 pupils are part of cis data set. According {aviable COMB is defined that indicates whether a cass such a multrade cass (COMB = 1; 59 clase), or entirely composed of grade 8 pupils (COMS 78 classes) Tae model forthe laaguage test score inthis example include main effects ‘and crosrlevel interactions between the following variables, Pupil lese! + 1Q (as used in the preceding examples) + SES = social-economic status ofthe pupil's family (a numerical viable ‘with mean 0 and standard deviation 10.9), Chass level + TQ = average 19 in the group + GS = group size + COMB = indicator of mult-grade clases (Te class average of SES is not included in this example, because other Analyses showed that this variable bas ao significant effet in other words, there is no significant diference becween the within-group and the becwees’ ‘0up regressions on SES.) Explanation of random intercepts and slopes 7 eee Baoan dates nso gt as ae ar oe eee ae rene eee ae Sia ergata ee = . “Sotinatag model (5.12) (with p = 2, q = 3) leads to the results presented Tecan Sec ore ‘Table 5.3 Estimates for model with random slopeo and many effects. Ceficient of GS Coeelet of 19 x 1G Coeielnt of 19 x G3 (Coeient of 1G x COMB, ite smal, mode (512) ental « umber of ta ates p and gare que smal, co tistical parametes that usualy istoo larg for comfort. "Therefore, two aon te oan used * {a} Not all X-rablen ae considered to have random som Note at th craton ate tae stops by the Pl ye eo sorte, a coin a eo gaat tt fon O (esting the parameter cused in Chaps 6) oy that even tatmate to be 0 “ Se coofients hy of certain variable Xy are vasa ‘Gar 'goupe, tb aot smear) to we al vaables Zy for explaining tc way The ue craton tt isang cach fy Dy cay a welcheoen aoe othe Bye ® Gitano tao oe nd ih ced ttn is wos, peas on sbjot wae a wel a emisal conde one "the nana! snp of toting and wel Bing re talon8 ‘The hierarchical linear model Example 5:4 A parsimonious model inthe case of many vara In Table 53, he random slope valance of SES wetted ty (happens ocesonaly o ston 88) Thule, hi andom dope's elsed oe We sale in Chapter 6, cha sigsifcane of iad eet canbe soplyng a tet tothe ratio etiate to standard eon Tha eek inf othe mod ewe all cemiod nterainy te eae hae interaction between 1Q and COMB. The resul timates are di in Table 5.4. Se ‘Table 8.4 Estimates for a more parsimonious model with a random slope and many effects. Fixed Eft Couttest sz “oe = erat gat _S “ho = Goeficen of la ‘ana “ros Goce of Ss diss oats {hs = Concen of COMB S308 are 1598 ast “ia = Coefilert of 1Q x COMB oar ain Random Bet ce agin Bt Variance Conponeat 8 geen 756 135 tH) zs uaa Tevelone variance: 8s a at) saa 129 Betws oes ‘The timate of th emanig ef do nt change mc om 53, but the standard errs of many feed codfceats des cent This canbe explain! by th oman eth many nowagee nee Nou hat COMB is ot «centered arable, 201 Someg wie satly «postive mean Al othe explanatory racabes hae eae fare, thelatrcpe oi he mean pps eh average aceneion sigh grade ca (COMB = 0), andthe repesioncthcae od ne avsags fc of Qin ago pad cute The meas of plies og ons hse i +s. The inracton eek ng the aol cies org ‘a mkigrede css, Hens te ef fda malades -o6 9 Compare to Bxample 52, ara out that i nt group ae COMB, that seems to have an fit otha man ek aod as ener fo with 1Q.Pattiond cases (COMB =I) lad fo mer bogey score abd 10 higher eft of ltlisco The uezpnnal te at haw opedent spe of laguage mar 10 conser we tat ale 52, when the iteration of 1Q with G3 gross), need was included in the model. ° ae ania ‘The model oud can be exreed as model with a cate me on ode! with variable intercepts Mis = Bog + Bry ass + Bas tay + Rag, Ezplanation of random intercepts and slopes nm where Xi 1Q and Xs is SES. The intercept is Bey = 900 + os 31) + aan 395 + Uos , where Zs is average IQ and Zs is COMB. The coeficient of Xi is, Buy = mo + nazay + is» while the coefciont of Xz la not variable, Bas 5.8.2 A general formulation of fized and random parts Formally, and in many computer programs, these simplifications lead to a representation of the hierarchical linear model that is slightly different from (6.12). (For example, the HLM program uses the formulations (5.10) ~ (6.12) whereas MLn uses formulation (5.14).) Whether a level-one variable ‘was obtained as across-level interaction or not is immaterial to the computer program. Even the difference between level-one variables and level-two variables, although possibly relevant for the way the data are stored, is not of any importance for the parameter estimation. ‘Therefore, all variables Jevel-one and level-two variables, including product interactions - can be represented mathematically simply as zy. When there are r explanatory variables, ordered so that the fist p have fixed and random coeficients, while the last r — p have only fixed coefficients” the hierarchical linear ‘model can be represented as Yg=a0 & Soomeny + Vay + Ui any + Ry (6.4) = Po The two terms, aot Someny and Uy + SUiyeny + Ry, are the fixed and the random parts of the model, respectively. In cases where the explanation of the random effects works extremely ‘well, one may end up with models without any random effects at level 2. In other words, the random intercept Uoj and all random slopes Un, ia (6-14) hhavezero variance, and may just as well be omitted from the formala. In this case, the resulting model may be analysed just as well with OLS regression analysis, because the residuals are independent and have constant variance. Of course, this is known only after the multilevel analysis has been carried out. In such a case, the within-group dependence between measurements hhas been fully explained by the available explanstory variables (and their interactions). This underlines that whether the hierarchical linear model is ‘amore adequate model for analysis than OLS regression depends not on the dependence of the measurements, but ox the dependence of the residuals. Th fe mathematically ponble that some variables bave a random but not a fixed lect. Tas makes sense oly a spell cases.0 ‘The hierarchical linear mode! 5.3 Specification of random slope models Given that random slope models are available, the researcher has many options to model his data. Each predictor may be assigned a random slope, and each random slope may covary with any other random slope. Parsi ‘monious models, however, should be preferred, if only for the simple reason that a strong scientific theory is general rather than specific. A good guide for choosing between a fixed or a random slope for a given predictor vati- able should preferably be found in the theory that is being investigated. If the theory (whether this is a general scientific theory or a practical policy theory) does not give any clue with respect to a random slope for a certain predictor variable, then one may be tempted to refrain from using random, slopes. However, this implies a risk of invalid statistical tests, because if some variable does have a random slope, then omitting this feature from the model could affect the estimated standard errors of the other variables. ‘The specification of the hierarchical linear model, including the random Part, is discussed more fully in Section 6.4 and Chapter 9. In data exploration, one can try various specifications. Often it appears ‘that the chance of detecting slope variation is high for variables with strong, fixed effects. This, however, is an empirical rather than a theoretical asser~ tion. Actually, it may well be that when a fixed effect is ~ almost ~ zero, there does exist slope variation. Consider, for instance, the case where male teachers treat boys advantageously over girls, whereas for female teachers the situation is reversed. If half of the sample consists of male and the other half of female teachers, then, all other things being equal, the main sgender effect on achievement will be absent, since in half of the classes the gender efect will be positive and in the other half negative. The fixed ef fect of students’ gender then is zero but varies across classes (depending on ‘the teachers’ gender). In this example, of course, the random effect would disappear if one should specify the cross-evel interaction effect of teachers’ sender with students’ gender. 5.3.1 Centering variables with random slopes? Recall from Figure 6.1 that the intercept variance and the meaning of the intercept in random slope models depend on the location of the X variable. Also the covariance between the intercepts and the slopes is dependent on this location. In the examples presented so far we have used an IQ score for which the grand mean was zero (the original score was transformed by subtracting the grand mean 1Q). This facilitated interpretation since the intercept could be interpreted as the expected score for a student with average IQ. Making the IQ slope random did not have consequences for ‘these meanings. In Section 4.5 a model was introduced by which we could distinguish ‘within- from between-group regression. Two models were discussed: Yeo + mozy + mm Fy + Voy + Rey (43) ‘Specification of random slope models aL and Yig = oo + Fo (2ij Fy) + Fons + Vos + Rey (49) 1 was shown that for = tio + 701, ho = “hoy 0 that the two models are equivalent. ‘Are the models aso equivalent when the effect of Xi or (Xqj ~ Xj) is random acroas groups? This was discussed by Kreft, de Leeuw, and Aiken (1995). Let us first consider the extension of (4.8). Define the levelone and level-two models Vig = Bog + Bag zis + Ay + Rey : Bos = 00 + Uos By =o + Uys substituting the leveltwo mode! into the level-one model leads to Yig = 00 + nozy + 1m Fy + Voy + Vigzis + Ry [Next we consider the extension of (49): ig = Bay + Bry (aay 5) + By + Rey Bos = 700 + Voy Bis = tho + Ug 5 substitution and rearrangement of terms now yields Yig = 00 + hoes + Cor — Ho) Fy + Uoy + Uisaes — Vig + By ‘This shows that the two models differ in the term Ui; .j whichis included in the group-mean centered random slope model but not inthe other model. ‘Therefore in general there is no one-to-one relation betweea the and the 4 parameters, so the models are not statistically equivalent except for the extraordinary case where variable X ben no between-group variability ‘This implies that in constant slope models one caa either use Xij and X,, or (Xy—X,) and Xs as predictors, since this results in statistically ‘equivalent models, but in random slope models one shoul carefully choose ‘one or the other specification. ‘On which consideration should this choice be based? Generally one should be reluctant to use group-mean centered random slopes models unless ‘there isa clear theory (or an empirical clue) that not inthe first place the absolute score Xi but rather the relative score (Xij ~ Xs) is related to Yy. Now (Xj ~ X,) indicates the relative position of an individual in his or her group, and examples of instazces where one may be particularly interested in this variable may be: + research on normative or comparative social reference processes (e4., Guldemond, 1994), + research on relative deprivation, + research on teachers’ rating of student performance.82 ‘The dierarchical linear model 5.4 Estimation What was mentioned in section (4.6) can be applied, with the necessary extensions, also to the estimation of parameters in the more complicated model (6.14). A number of iterative estimation algorithms have been proposed eg, by Caird and Ware (1982), Goldstein (1086), and Longford (2987), and are now implemented in multilevel software. ‘The following may give some intuitive understanding of estimation meth- ds. HT the parameters of the random part, i., the parameters in (5.13) together with o?, were known, then the regression coeflcients could be eat. mated straightforwardly with the so-called generalized least squarts (‘GLS') method. Conversely, if all regression coeficients yy, were known, the ‘otal ‘residuals’ (which seems an apt name forthe second line of equation (6.12)) could be computed, and their covariance matrix could be used to eatimate the parameters of the random part. These two partial estimation processes can be alternated: use provisional values for the random part parameters to estimate regression coefficients, use the latter estimates to estimate the random part parameters again (and now better), then go on to estimate the regression coeficients again, and so on ad libiturn ~or, rather, until conver. Bence of this iterative process. This loose description is close to the iterated Seneralized least squares (IGLS') method that is one of the algorithms to Ccaleulate the ML estimates, There exist other methods (one called Fisher scoring, treated in Longford (1098), the eter aud 2M or Expectation“ Mesionon Sees and Raudenbush (1992)) which calculate the same estimates, each with ve own advantages, Parameters can again be estimated with the ML or with the REML ‘method; the REML method is preferable in the sense that it produces less biased estimates forthe random part parameters in the case of small sample sizes, but the ML method is more convenient if one wishes to use devianoe tests (fee the next chapter). The IGLS algorithm produces ML estimates, whereas the so-called RIGLS (‘restricted IGLS") algorithm yields the REM, Etiimates. For the random slopes model also it is posible that estimates for the variance parameters 7} are exactly 0. The explanation is analogous to the explanation given for the Intercept variances in Section 4.6 ‘The random group effects Uas can again be ‘estimated’ by the empirical Bayes method, and the resulting ‘estimates’ are called posterior slopes (Sometimes posterior means). This is analogous to what is treated in Section 47 about posterior means Usually the estimation algorithms do not allow to include an unlimited ‘number of random slopes. Depending on the data set and the model spect. fication, itis not uncommon that the algorithm refuses to converge for more than two or three variables with random slopes. Sometimes the convergence can be improved by linearly transforming the variables with random slopes ‘0 that they have (approximately) zero means, or by transforming them to hhave (approximately) zero correlations. Three and more levels 83 For some det et the etnaton method cn produce atinated vas and covariance parameters that correspond to impossible covatiance tatrices forthe random eet at level two, eg 7x sometimes is extnatod larger than 7 x 7. This would imply an intercept-sope correlation larger than 1. This is not an error of the estimation procedure, and it can be ‘understood aa follows. The estimation procedure is directed at the mean ‘vector and covariance matrix ofthe vector of all observations. Some combinations of parameter values correspond to permissible structures of the latter covariance matrix that, nevertheless, cannot be formulated a8 random eects model such as (53). Even if the estimated yalues forthe rx parameters do not combine into a postive definite matrix r, the o? pa- ameter will still make the covaviance matrix of the original observations (€& equations (5.5) and (6.6)) postive definite. Therefore, such a strange result is in contradiction to the random effects formulation (5.3), but not to the more general formulation of a patterned covariance matrix for the observations lost computer programs give the standard errs of the variances of the random itrept tad slopen some give the standar eos othe eta standard deviations f instead. The two standard errs can be transfortned into each other by the approximation formula SE?) & 27SE(F)- (6.15) jowever, some caution is necessary in the use ofthese standard errors. The ves in on he ee. Te cetimaced vaiue plus or minus twice the standard error, is a valid approai- ‘mation only if the relative standard error of # (Le, standard error divided by parameter estimate) is small, say, les than 1/4 8.5 Three and more levels ‘When the data have a three-level hierarciy, slopes of level-one variables can bbe made random at level two and also at level three. Th this case there will be at least two level-two and two level-three equations: one for the random intercept and one for the random slope. So, in the case of one explanatory variable, the mode! might be formulated as follows: Base + Bisetn + Ran (Level-one mode!) ron + Uayn (Level-two mode! for intercept) Basu = bon + Uijn (Level-two mode! for slope) ayn = 00 + Voos (Level-turee model for intercept) 4534 = h00 + Vios (Levelthree model for slope) In the specification of such a model, for each level-one veriable with random slope it has to be decided whether its slope must be random at level two, random at level three, or both. Generally one should have either strong a priori knowledge or a good theory to formulate models as complex as this ‘one oF even more complex models (Le., with more random slopes). Further,cy ‘The hierarchical Hnear model {or each level-two variable it must be decided whether its slope is random at level three. Example 6.5 A three-level model with random slope. We continue with the example of Section 4.8, where we illustrated the three- level model using a data set about a math text administered to students in classes in schools. Now we include the available covaziates (which are all centered around their grand means) and moreover the rogreaion coefficeat for the mathematics pretest is allowed to be random at level two and level ‘three. ‘Te results are in Table 6.5. ‘Table 5.5 Estimates for three-level model with random slopes. xed Bice Coe se. Type = Baber Coecent of 1Q 0.080 8.008 Conllicient af preter os oon Cofiient af tivation 3 008 CCoefcent of fathers education os oo = Gone of grader oz f.i0s Random Bit Vasance Component $2. Fare rnd CDS f= vation) osm 054 0.0024 010 oasi 013s 049 0.089 001 0.0000 O99 0.088 sora as ‘The interpretation of the fixed partis straightforward as in conventional single-level regression models. The random parts more complicate Since al ‘Predictor vatiables were grand mean centered, the intercept vatiances (level three) and 7 (love two) have a clear meaning: they represeat the amount of ‘aviation in mathematios achievement across schools and across claocs within Schools, respectively, for the average student whilst controlling for diferences in 1Q, mathematics ability, achievement motivation, educational level of the father, and gender. Comparing this table with Table 4.5 shows that much of the initial level-three and (especially) level-two variation has now been. ‘accounted for. Once there is control for initial differences, schools and clases ‘within schools ifer considerably les in the average mathematics achievement of ther students at the end of grade two. Now we turn to the slope variance. ‘The fixed slope coefficient for the ‘mathematics pretest i estimated to be 0.146. The variance at level three for {his slope is 0.0024, and at level two 0.0019. So the variability between schools ofthe effect of the pretest a somewhat larger than the variability ofthis effect between classes. On one end ofthe distribution there are afew percent of the schools that have an effect of the pretest that is only 0.146 — 2x VOOO2 ‘Three and more levels s VOOR = 0s, when in he mos scien shi ci 0248-+2 VOTE Bask” Sac tae arin fice ao ch win Schools, the gap Deon na aod nl Bgh iver (andar ‘deviations apart; this standard deviation for the pretest is 8.21) wit school can become as big as (0.146 + 2 0.0034 + 0.0019) x (48.21) Spann whens onthe ther Panda a on a (0105 — 2 ‘0024 +0061) x 4x 8.21) ~ 1 pin inthe et lective schools. Give Me Racked deviation of 84 for tho dependent variable a difrence of 15 very low whereas 95 gute a lage diference6 Testing and Model Specification (it is assumed for this chapter that the reader has a basic knowledge of statistical testing: null hypothesis, alternative hypothesis, errors of the frst and the second kind, significance level, statistical power } 6.1 Tests for fixed parameters Suppose we are working with a model represented as (6.14). The mull hy- Pothesis that a certain regression parameter is 0, Le., Hy: 74 =0, (61) can be tented by a t-test. The statistical estimation leads to an estimate jy with associated standard error S.E.(j,). Their ratio is a t-valuer (62) One-sided an well a tsded tests can be caved out on the basis ofthis test statistic.’ Under the null hypothesis, ‘T (7x) has approximately a t distribution, but ehe umber of degree of feedom (ds somevat coe complicated than in matiple near represion, becawe of th presence of the two lees. The approximation by the raisiibation not watt even the normality assumption for the random coafiients holds, Senpese fg that we are ting the coeficent of leone variable ln aeoriares with the specication of mode! (5.14), a crom level ineraction vacable also considered alee-one variable) If the total namber of lod one ots '5 Mand the total numberof explanatory variables isn, then so cen tala af. = M—r—1. For testing the coefivient oft leva tee rasan wine there aze W level-ino units and q explanatory arable aleve Boge taka af.=N-q~1 Af the number of units minus the number of variables is laige enough, say, larger than 40, the fistibation can be replaced ty a standard normal dstbaton. Example 6.1 Testing within. and between-group regressions. We wish to test if between-group and withingroup regressions of language ‘teat score on 1Q are different from one another, when controlling for social economic status (SES). A model with a random slope for 1Q is used. Two ‘models are estimated and presented in Table 6.1. The frt contains the row "This is ove ofthe common priacples for canstracion of « bet. This type af et ia called tbe Wold tet afer the satintcian Abrahare Wald (1900-1950), ‘Tests for ized parameters 87 1Q variable aloog with the group meas, the second coctains the within-group deviation variable 19, defined as 1, =10,-10,, also together with the group mean. To test whether within- and between igoup regression coficeat are diflret, the sgnieance of the group meas 3 Ea wl couolng th loo ecg! asc, Phe S and between-group regressions is discussed further in ‘Table 6.1. Estimates for two models with different bbetween- and within group regresions. Modal Mol 2 esr mae— Ca SE “pe = Bie ‘on os ae om Jo = Concent of1Q “230 Down “po = Comet of 226 oon Jar Gomcen sf58s ~ 1s cos ae Sas Ji Sones tig” Cost Saas Sms aa asdom Bets Vaz Comp $2. Var Comp. SE. “Tesco endow Wa daante) fis Blane iro oa Garo dom, Ce a) wa im aim 15.037 ssion7 ‘The table shows that only the estimate for IQ differs between the two models. This win ecnrdance with Secon 45: H1Q is asiable 1 and IQ's arahe 2, thatthe rgreston conics reo 308m, texpetey hen te thio group repeson conics Is yy t Mods? and’ i Model 2, hie te beiweengou regres concer too a del! andor IE Mocel 2 The tod are egies epesentation ofthe data ad fer oy inte prasad ee ty te “The wthnegrop regresion sod Dewee-goup serene ae he same {ya yy hn ap tan Sate tml entrla for thea arable (ie sb vaiabl without grouprenter The et sat fareing Hn =O la Moda awl 108/095 2'pss Tin strongly seas (p< D002). Te many be cncladed that within goup and beeen group represions se sgiasly diferes, "Te eau fat Mel? ean be sed fo x if he within group or Between sroup ropeion ae 0. The tas ati for tesing he within pou Fe Groton w 2209/0082 ~ 214, the sate for ‘ang the termes gOU fegrenon i 3345/0820 — 104 Both are extremly spins Concad. inp the are poste within poup av wll ts beee- group regressions ad these are dierent fom one another88 Testing and model specification 6.1.1 Malti-parameter tests for fzed effects Sometimes we with to text several regression parameters simultaneously. For example, consider testing the effect of a categorical variable with more than three categories. The effect of such a variable can be represented in the fixed part of the hierarchical linear model by ¢— 1 dummy variables, ‘where oie the mumber of categorie, and this eect is nil if and only if all the corresponding ¢~ 1 regression coefficient are 0. Two types of test are such used for this purpose: the multivariate Wald test and the likelihood ratio test, also knowa as the deviance test. The later test is explained in the next section For the multivariate Wald test, we need not only the standard errors of the estimates but also the covariances among them. Suppose that we consider a certain vector + of ¢ regression parameters, for which we wish to test the noll hypehess Hy:7=0. (63) ‘The statistical estimation leads to an estimate 7 and an associated estimated covariance matrix ©. From these, we can let a computer program calculate the test statistic represented in matrix form by Very. (6.4) ‘Under the null hypothesis, the distribution of this statistic can be approximated by the chi-squared distribution with q degrees of freedom.? The way of oblaining tests presented in this section is not applicable to tests of whether parameters (variances or covariances) ia the random part of the model are 0. The reason is the fact that, if @ population variance parameter is its estimate divided by the estimated standard error does not approximately havea t-dstribution. Test for such hypotheses are discussed in the next section. 6.2 Deviance tests ‘The deviance test, or likelihood ratio test, is a quite general principle for statistical testing. In applications of the hierarchical linear model this test is mainly used for multiparameter tests and for tests about the random part of the model. The general principle is as follows. ‘When parameters ofa statistical model are estimated by the maximum likelihood (ML) method, the estimation also provides the likelihood, which ‘can be transformed into the deviance defined as minus twice the natural logarithm of the likelihood. ‘This deviance can be regarded as a measure of lack of fit between model and data, but (in most statistical mode's) one cannot interpret the values of the deviance directly, but only diferences in deviance values for several models fitted to the same data set. This approximation pelea the fart that Sin matimated, and nat mown exactly 1 corresponds to using the standard normal distibution to tet the value of (62). te peiogple, this could be taken into acount ty ual the F rather than the chisquared Aiatebutio. IF the sunber af groupe i large, the diference le aot appreciable. Deviance tests 89 Suppose that two models are fitted to one data set, model Mo with ma parameters and a larger model M, with m parameters. So M, can be Fegarded as an extension of My, with m ~ mg parameters added. Suppose that Mo is tested aa the null hypothesis and M; is the alternative hypothesis. Indicating the deviances by Do and Dy, respectively, thei difference Dy— Dy can be used ao atest statistic having a chi-squared distribtion with m—mo degrees of freedom, This type of test can be applied to parameters of the fixed as well as of the random part. ‘The deviance produced by the residual maximum likelihood (REML) ‘method can be used in deviance tests only ifthe two models compared (Mo ‘and Mj) have the same fixed parts and differ only in their random parts. Example 6.2 Test of random intercept. In example 42, the random intercept model yields deviance of D1 = 15251.8, ‘while the ODS regression model has a deviance of Dp = 15477,7. There is ‘mime = 1 parameter added, the random intercept. The deviance diference 15 2259, immensely significant in a chi-squared distribution with d.f. ‘This implies that, even when contrlling for the effect of 1Q, the diferences between groupe are strongly significant For example, suppose one is testing the fixed effect of a categorical ex: planatory variable with c categories. This categorical variable can be represented by c~1 dummy variables. Model My is the hierarchical linear model with the effects of the other variables in the fixed part and with the given random part. Model Mj also includes all these effects; in addition, the ¢—1 regression coefficients of the dummy variables have been added. Hence the difference in the number of parameters is m—m = eI. This implies that the deviance difference Do ~ D1 can be tested in a chi-squared distribution with df. =c— 1, This testis an alternative for the multi-parameter Wald test treated in the preceding section. These tests will be very close to each other for intermediate and large sample sizes. Example 6.3 Bffect of a categorical variate. In the data set used in Chapters 4 and 6, schools differ according to their de- ‘nomination: public, eatholl, protestant, or non-denominational private. To represent these four categories, three dummy variables are used, contrasting the last three agains: the first category. This means that all dummy variables are 0 for public achools, the ist is 1 for catholic schools, the second is 1 for protestant schools, and the third is 1 for non-denominational private schools. ‘When the fixed effects ofthese c—1 = 3 variables are added to the model presented in Table 5.2, which in eis example has the role of Mo with deviance ‘Dy = 15208.4, the deviance decreases to D; = 15198.6. The chi-squared value is Dp - Dy = 148 with df. =3, p < 0.005, Tt can be concluded that, when controlling for 1Q and group size (see the speciation of mode! Mo in Table 152), there are dilferences betweon the three types of school "The estimated fixed eects (with standard errors) of the dummy variables are 1.70 (0.64) for the cathohe, ~0.0 (0.67) for the protesant, and 1.09 (1.30) for the no-denomizational private schools. These effects are relative to the public schools, This implies that the catholic schools achieve higher,90 Testing and model specification controlling for 1Q and group sie, than the public schools, while the other two categories do not differ significantly from the public schools, 6.2.1 Halved p-values for variance parameters ‘Variances are by definition non-negative. When testing the null hypothesis that a variance of the random intercept or of a random slope is zero, the alternative hypothesis is therefore one-sided. This observation can indeed bbe used to arrive at a sharpened version of the deviance test for variance ‘parameters. Tis principle was derived by Mille (1977) and Self and Liang (1987) First consider the case that a random intercept is tested. The null model ‘Mo then is the mode! without a random part at level 2, ie. all observations Yj ate Independent, conditional on the values of the explanatory variables. This is an ordinary linear regression model. ‘The alteraative model Mi is the random intercept model with the same explanatory variables. There is 1m ~ mo = 1 additional perameter, the random intercept variance 73. For the observed deviances Dy of model Mg (this model can be estimated by ordinary least squares) and D, ofthe random intercept model, the difference Dp ~ D, is caleulated. If Dp ~ Dy = 0, the random intercept variance is definitely not significant (i is estimated as being 0..). If DpDy > 0, the tall probability ofthe difference Da ~ D; is looked up ina table of the chi- squared distribution with d.f. = 1. The p-value for testing the significance of the random intercept variance is alf this tal value. ‘Second, consider the case of testing a random slope. Specifically, suppose that model (6.14) holds, and the mull hypothesis is that the last random slope variance is zero: 73 = 0. Under this null hypothesis, the p covariances, ‘Tp for h=0,...,p— 1, also are 0. The alternative hypothesis is the model defined by (6.14), and has my ~ mo = p-+1 parameters more than the null model (one variance and p covariances). For example, if there are no other random slopes in the model (p = 1), m — mo = 2. ‘The same procedure is followed as for testing the random intercept: Both models are estimated, yielding the deviance difference Do ~ D1. If Dy ~ D, = 0, the random Biope variance is not significant. If Do - D; > 0, the tail probability of the difference Do ~ D; is looked up in e table of the chi-squared distribation with df. = p+1. The p-value for testing the significance of the random slope variance is haf tis tail value "The argumentation can be given lowely as follows. Ifthe variance parameter is, ‘ea the probably is aboot 1/2 tat the erimated vale 0, and also 1/2 that tbe tatimated vale i peitiv. For example, for ntimating the leve-twoiterept vince 7? {nan empty moder, formula (311) can be und If indeed +® = 0, the proba ie about 1/2 that ths expreaion wil be oegetve, aad therefore treated to 0. If the vranee parameter in etinated ar, the aolated comsince parameters ao are etimated 2 Ov with the rent that all eimated parameters under model Sf; are the same as under ti. ‘This irplies D, = Dy. The e-squared distribution bolds under the coodton tbat ‘the variance parareer is extmaed at's pelive value. I an be concluded thatthe cll diaibution ofthe deviance diference fs a so-aled mixed dntebution, with probably 1 Tor the valve 0, and probability 1/2 for cboquared dstrbution. Other tests for parameters in the random part a1 Example 6.4 Test of random slope ‘When comparing Tables 4.4 and 5.1, it can be concluded that 1m ~ mo parameters are added and the deviance diminishes ty Dy — Dy = 15227.5— 15212.5 = 140, Testing the value of 14.0 in a chi-squared distribution with 4. = 2 yields p < 0001. Halving the pvalue leds to p < 0.0005. Thus, the ‘Sguifcance probability ofthe random slope for IQ in the model of Table 5.1 ‘sp < 0.0005 ‘As another example, suppose that one wishes to test the significance of the random slope for IQ in the model of Table 5.4, Then the model must be fitted in which this effect is omitted and all other effects remain. Thus, the omitted parameters are 7? and mo so that df. Fitting this reduced model leads to a deviance of Dp = 15101.5, which is 8.5 more than the deviance in Table 54, The chrsquared distribution with df. = 2 gives a tall probability of p < 002, 90 that halving the pvalue yields p < 0.01, Thus, testing the random slope for 1Q in Table 5.4 yields a sigaificant outcome with p < 0.0. 6.3 Other teste for parameters in the random part, ‘The deviance tests are very convenient for testing parameters in the random part, but other tests for random intercepts and slopes also exist. In Section 3.3.2, the ANOVA F-test for the intraclass correlation was mentioned. This is fectively a test for randomness of the intercept, If it is desired to test the random intercept while controlling for explanatory variables, one may ‘use the F-test from the ANCOVA model, using the explanatory variables as covariates. ‘Bryk and Raudenbush (1992) present chi-squared tests for random intercepts and slopes, i.e. for variance parameters in the random part. They ‘are based on calculating OLS estimates for the values of the random effects with each group and testing these values for equality. For the random intercept, these are the large-sample chi-squared approximations to the F- tests in the ANOVA or ANCOVA model mentioned in Section 3.3.2. * ‘Another test for variances parameters in the random part was proposed by Berkhof and Snijders (1998), and is explained in Section 9.2.2. 6.4 Model specification Model specification isthe choice of satisfactory model. For the hierarchical linear model this amounts to the selection of relevant explanatory variables (and interactions) in the fixed part, and relevant random slopes (with their covariance pattern) in the random part. Model specification i one of the ‘most difficult parts of statistical inference, because there are two steering wheels: substantive (eubject-matter related) and statistical considerations. These steering wheels must be handled jointly. The purpose of model specification is to arrive at a model that describes the observed data to a satisfactory extent but without unnecessary complications. A parallel purpose {s to obtain a model that is substantively interesting without wringing from the data drops that are really based on chance but interpreted as substance.2 ‘Testing and model specification {e linear regression analysis, model specification is already complicated (as is elaborated in many textbooks on regression, eg, Ryan, 1997), and in ‘multilevel analysis the number of complications is multiplied because of the complicated nature of the random part. ‘The complicated nature of the hierarchical linear model, combined with ‘the two steering wheels for model specification, implies that there are no clear and fixed rules to follow. Model specification is a process guided by the following principles 1, Considerations relating to the subject matter. These follow from field knowledge, existing theory, detailed problem formulation, and com- 2. The distinction betwen effects that are indicated a priori as effects to be tested, ie, effects on which the research is focused, and effects that are necessary to obtain a good model fit. Often the tested effects are a subset of the fixed effects, and the random part is to be fitted adequately but of secondary interest. When there is no strong, prior knowledge about which variables to include in the randoza part, one may follow a data-driven approach to select the varibls for the random part. ‘3. A preference for ‘hierarchical’ models in the general sense (not the ‘hierarchical linear model’ sense) that if a model contains an interaction effect, then usually also the corresponding main effects should bbe included (even if these are not significant); and if a variable has a random slope, it normally should also have a fixed effect. ‘The reason. is that omitting such effects may lead to erroneous interpretations 4. Doig justice tothe multe! nature ofthe probe. Thins done ax folows. (a) When a given level-one variable is present, one should be aware ‘that the within-group regression coefficient may differ from the between-group regression coefficient, as described in Section 4.5. This can be investigated by calculating a new variable, defined ‘as the group mean of the level-one variable, and testing the effect of the new variable. (&) When there isan important random intercept variance, there are important unexplained differences between group means. One may look for level-two variables (original level-two variables as well as aggregates of level-one variables) that explain part of these between-group differences. (©) When there is an important random slope variance of some level- cone variable, say, X;, there are important unexplained differences between the within-group effects of X; on Y. Here also one may look for leve-two variables that explain part of these differences, ‘Model specification 93 ‘This leads to cross-lovel interactions as explained in Section 5.2. ‘Recall from this section, however, that cross-level interactions ‘can also be expected from theoretical considerations, even if no significant random slope variance is found. 5. Awareness of the necessity of including certain covariances of random effects. Including such covariances means that they are free parameters in the model, not constrained to 0 but estimated from the data. In Section 5.1.2, attention was given to the necessity to include in the model all covariances 1, between random slopes and random Intercept. Another case of this point arises when a categorical variable with © > 3 categories has a random effect, This is implemented by giving random slopes to the c— 1 dummy variables that are used to represent the categorical variable. The covariances between these random slopes should then also be included in the model. Formulated generally, suppose that two variables Xj and Xy have random effects, and that the meaning of these variables is such thet they could be replaced by two linear combinations, aX +a! Xy and 2X +UXy. (For the random intercept and random slope discussed in Section 6.1.2, the relevant type of linear combination would corre spond to a change of origin of the variable with the random sope.) ‘Then the covariance ray’ between the two random slopes sheuld be included in the model. 6. Reluctance to include non-significant effects in the model; one could also say, a reluctance to overfitting. Each of points 1-4 above, however, could override this reluctance. ‘An obvious example of this overriding is the case where one wishes to test for the effect of Xz, while controling for the effect of Xy. The ‘purpose of the analysis is a subject matter consideration, and even if the effect of X; is non-significant, one still should include this effect in the model. 7. The desire to obtain a good fit, and include all effects in the model that contribute to a good fit; in practice, this leads to the inclusion of all significant effects unless the data set isso large that certain effects, although significant, are deemed unimportant nevertheless. 8. Awareness of the following two basic statistical facts: (2) Every test of 2 given statistical parameter controls for all other cffects in the model used as a null hypothesis (Mo in Section 6.2). Since the latter set of effects has an influence on the interpretation as well as on the statistical power, test results may depend ‘on the set of other effects included in the model (b) We are constantly making errors of the fist and the second kind. ‘Bspecially the latter, since statistical power often is rather low.4 Testing and model specification ‘This implies that an effect being non-significant does not mean that the effect is absent in the population. It also implies that significant effect may be so by chance (but the probability of this is no larger than the level of significance ~ most often set at 0.05). Multilevel research often is based on data with a limited number of groups. Since power for detecting effects of level-two variables, depends strongly on the number of groups in the data, warnings bout a low Dower are especially important fr lee two vat ‘These considerations are nie; but how to proceed in practice? To get an insight into the data, ii usually advsabe to tart with a decriptive analysis ofthe variables: an investigation oftheir means, standard deviations, cor- felations, and distrthutional forms. Iti aso helpfl to make a preliminary (quick and dirty’) analyeis with a simpler method such as OLS regression. Wren starting with the multilevel analysis as such, in most situations (longitudinal data may provide an exception), it is advisable to start with fitting the empty model (4.6). This gives the raw within-group and between- ‘group variances, from which the estimated intraclass correlation can be cal= culated. These parameters are useful as a general description and a starting Point for further model fitting. The process of further model specification will include forward steps: select additional effects (fixed or random), tent their significance, and decide whether or not to include them in the model; and backward steps: exclude effects from the model because they are not {important from a statistical or substantive point of view. We mention two possible approaches in the following subsections. 6.4.1 Working upward from level one In the spirit of Section 5.2, one may start with constructing a model for level one, ie. first explain within-group variability, and explain between- group variability subsequently. ‘This two-phase approach is advocated by Bryk and Raudenbush (1992) and is followed in the program HLM (Bryk, Raudenbush, and Congdon, 1996) (see Chapter 15). ‘Modeling within-group variability Subject matter knowledge and availability of data leads to a number of level-one variables X; to X, which are deemed important, or hypothetically important, to predict or explain the value of ¥. These variables lead to ‘equation (5.10) as a starting point: Vig = Bog + Bag zicg +o. + Opp pig + By» ‘This equation represents the within-group effects of Xx on Y. The between group variability is first modeled as random variability. This is represented ‘Model specification 95 by splitting the group-dependent regression coeficents (yj in a mean coef ficient ‘ho and a group-dependent deviation Uj: Bag = 0 + Ung Substitution yids Yig=70 + Sona any + Uy + DU zny + Ry 5) & & ‘This model has a number of level-one variables with fixed and random fects, but it wil usually not be necessary to include all random effects. For the precise specification of the level-one model, the following steps are useful. 1. Select in any case the variables on which the research is focused. In addition, select relevant available level-one variables on the basis of subject matter knowledge. Also include plausible interactions between level-one variables. 2, Select, among these variables, those for which, on the basis of subject matter knowledge, a group-dependent effect (random slope!) is plausible. If one does not have a clue, one could select the variables that are expected to have the strongest fixed effects. 13, Estimate the model with the fixed effects of step 1 and the random fects of step 2. 4, Test the significance of the random slopes, and exclude the non- significant slopes from the model 5, Test the significance of the regression cocficients, and exclude the non significant coefficients from the model. This ean also be a moment: to consider the inclusion of interaction effets between level-one variables. 6. For a check, one could test whether the variables for which a group dependent effect (Le., « random slope) was not thought plausible in step 2, indeed have a non-significant random slope. (Keep in mind that, normally, including a random slope implies inclusion of the fixed effect!) Be reluctant to include random slopes for interactions; these are often hard to interpret. ‘With respect to the random slopes, one may be restricted by the fact that data usually contain less information about random effects than about fixed cffects. Including many random slopes can therefore lead to long iteration processes of the estimation algorithm. The algorithm may'even fail to com ‘verge. For this reason it may be necessary to specify only a small number of random slopes. "After this process, one has arrived at a model with a number of level-one ‘variables, some of which have @ random iz addition to their fixed effect. It is possible that the random intercept is the only rexaining random effect,96 Testing and model specification ‘This model isan interesting intermediate product, as it indicates the within- group regressions and their variability. Modeling betveen-group variability ‘The next step is to try and explain these random effects by level-two vati- fables. The rindom intercept variance can be explained by level-two vari- ‘ables, the random slopes by interactions of level-one with level-two variables, ‘as was discussed in Section 5.2. It should be kept in mind that aggregates of level-one variables can be important level-two variables. For deciding which main effects of level-two variables and which cross- level interactions to include, itis again advisable first to select those effects ‘that are plausible on the basis of substantive knowledge, then to test these and include or omit them depending on their importance (statistical and substantive), and finally to check whether other (less plausible) effects also are significant. ‘This procedure has a builtin filter for cross-level interactions: an interaction betwee level-one variable X and level-two variable Z is considered only if X has a significant random slope. However, this ‘lter' should not bbe employed as a strict rule. If there are theoretical reasons to consider the Xx Z interaction, this interaction can be tested even if X does not have Significant random slope. The background to this is the fact that if there is X'x Z interaction, the test for this interaction has a higher power to detect {this than the test for a random slope. It is possible that, if one carries out both tests, the test of the random slope is non-significant whereas the test of the X x Z interaction is indeed significant. This implies that either an error of the first kind is made by the test on X x Z interaction (this is the case if there is no interaction), or ‘an error of the second kind is made by the test of the random slope (this is ‘the case if there is interaction). Assuming that the significance level is 0.05, and one focuses on the teat of this interaction effect, the probability of the first event ia less than 0.05 whereas the probability of an error of the second kkind can be quite high, especially since the test of the random slope does not always have high power for testing the specific alternative hypothesis of an X xZ interaction effect. Therefore, provided that the X x Z interaction effect was hypothesized before looking at the data, the significant result of the test of this effect is what counts, and not the lack of significance for the random slope. 6.4.2 Joint consideration of level-one and level-two variables ‘The procedure of first building a level-one model and subsequently extending it with level-two variables is neat but not always the most eficient, or the most relevan:. If there are level-two variables or cross-level interactions that are known to be important, why not include them in the model right from the start? Fer example, it could be expected that a certain level-one vanable has @ within-group regression differing from its between-group regression, In such a case, one may wish to Include the group mean of this variable Model specification 97 right from the start. In this approach, the same steps are followed as in the preceding section, but without the distinction between level-one and level-two variables. This leads to the following steps. 1, Select relevant available level-one and level-two variables on the ba sis of subject matter knowledge. Also include plausible interactions, Doa’t forget group means of level-one variables to account for the possibility of diference between within-group and between-group re- ‘greasions. Also don’t forget cross-level interactions. 2, Select among the level 1 variables those for which, on the basis of subject matter knowledge, a group-dependent effect (random slope!) plausible. (A possibility would be, again, to select those variables that are expected to have the strongest fixed effects.) 43 Estimate the model with the fixed effects of step 1 and the random fects of step 2. 4, Test the significance of the random slopes, and exclude the non- significant slopes from the model. 5, Test the significance of the regression coeficients, and exclude the zon- significant coefficients from the model. Ths can also be a momen: to consider the inclusion of more interaction effects. 6. Check if other effects, thought less plausible at the start of model building, indeed are not significant. If they are significant, include them in the model. In an extreme instance of step 1, one may wish to include all available variables and a large number of interactions in the fixed part. Similarly, one right wish to give all level-one variables random effects in step 2. Whether this is practically possible wil depend, among others, on the numberof level- fone variables, Such an implementation of these steps leads to a backward model-fiting process, where one starts with a large model and reduces it by stepwise excluding non-significant effects. The advantage is that mashing fects (where 2 variable is excluded early in the model building process because of non-significance, whereas it would have reached significance if fone had controlled for another variable) do not occur. The disadvantage is that it may be a very time-consuming procedure. 64.9 Concluding remarks about model specification This section has suggested a general approach to specification of multilevel models rather than laying out a step-by-step procedure. ‘This is in accordance with our view of model wpecification being a proces with two steering ‘wheels and without foclproof procedures. ‘This implies that, given one data set, a researcher (let alone two researchers...) may come up with more taan98 Testing ond model specification cone model, each seeming in itelf a satisfactory result of a model specise cation process. a out view, this reflects the basic indeterminacy thet is inherent to model fitting on the basis of empirical data. Iti well posible ‘hat several diferent modes coreapond to ave data, and that there are no compelling arguments to choose between them. Ia s it beter to accept this indeterminacy and lave it to be relved by fete research than to make an unwarranted choice between the efferent models, This treatment of model specication may seem rather inductive, oy data-driven. “If one is in the fortunate situation of having prion hy, potheses to be tested (usually about regression coefficients), tis useful to distinguish between those parameters on which hypothesis tests are focused and the other parts of the model, required to have a wellfitting model and (consequently) a vaid tet ofthese hypotheses. An inductive approach is adequate for the latter part of the model, while the tested paremeters evidently are to be included in the model anyway. Another aspect of model specification is the checking of assumptions Independence assumptions should be checked in the couse of speciying the randow part ofthe model. Distributional assumptions, specifically: the, sssumption of normal dstrfbtion for the various Fandom ects: soe be checked by residual analysis. Checks of assumptions are reated in Chap 7 How Much Does the Model Explain? 7.1 Explained variance ‘The concept of ‘explained variance’ is well-known in rultiple regression analysis: i gives an answer to the question, how much of the variability of the dependent variable is accounted for by the linear regression on the e*- planatory variables. The usual measure for the explained proportion of var- ‘nce is the squared multiple correlation coefficient, H?. For the hierarchical linear model, however, the concept of ‘explained proportion of variance’ is somewhat problematic. In this section, we follow the approach of Snijders snd Bota (100) to explain the fice and give a subi muller version of F*. ‘One way to approach this concept is to transfer its customary treatment, wellknown from multiple linear regression, straightforwardly to the Ierazchical random effects model: treat proportional reductions in the es timated variance components, 7 and 29 in the random-intercept model for to levels, as analogues of H? values. Since there are several variance com ‘ponents in the hierarchical linear model, this approach leads to several values, one for each variance component. However, this defiaition of R now and then leads to capleasant surprises: it sometimes happens that adding explanatory variables increases rather than decreases some of the variance components. Even negative values of R? are peesible. Negative values of FE clearly are undesirable and are not in accordance with its intuitive interpretation In the discussion of R-type measures, t should be kept in mind that ‘these measures depend on the distribution ofthe explanatory variables, ‘This implies that these variables, denoted in this section by X, are supposed to be drawa at raadom ffom the population at levl one and the population at level two, and not determined by the experimental design or the researcher's ‘whims. In order to stress the random nature of the X-variable, the values of X are denoted Xy, instead of by 2i a8 in earlier chapters. 7.1.1 Negative values of RY ‘As an example, we consider data from a study by Vermeulen and Bosker (1992) on the effects of part-time teaching in primary schools. The de- 99100 How rauch does the model explain? pendent variable ¥ is an acithmetic test score; the sample consists of 718 grade 3 pupils in 42 school, “Aa intelligence txt score Kis taed te poe dictor variable. Group sizes range from 1 to 33 with an average of 20. In the folowing, i sometimes is desirable to present an example for balecced data (-e, with equal group sizes). The balanced data presented below st the data restricted to 33 schools with 10 pupils in each school, by deleting schools with less than 10 pupils from the sample and randomly sampling 10 pupis from each school fom the remaining schools. For den onstrate purpose ode wr fle: the empty Moe A Model B ith group mean, X s, as predictor variable; and Model C, with s withineem deviation score, (Xy~X,), an predicir variable Table 71 wrest oe results of the analyses both for the balanced and for the entire data set, ‘The residual variance at level one is denoted 0; the residual vasiance ai level two is denoted rf. From Table 7.1 we see that in the balanced as ‘Table 7.1 Estimated residoal variance ps # and if ance parameters # and #2 for ‘models with within-group and between-group predictor variables, + BlXiy~ Kj) +0 +B 6973 2.443 TE Unbalanced desiga ALY = Bo4 Voy + By, 7.653 2.708 BLYiy = Bo +f, Xy 4+ Uoy + By 7.685 2.038 CY = f+ Bi(Xy~Xj)+ Uy + By 6.658 2801 well as in the unbalanced case, #2 incr as a within-grot ; #9 increases a8 a within-group deviation yatable is added as an explanatory variable to the model Purthernoce for the balanced case, 3 ip not affected by adding a group-level variable to the model. In the unbalanced case, 6? increases slightly when arding fo sta set R? on the devel is nega- trl Bata 2 ill hy ater arama! ste REE etd re same. It is argued below that defining A? as the Proportional reduction in residual variance parameters 6? and ‘3, respectively, is not the best way to aoe — cae ‘FP in the linear regression model; and that obi snl nats a Ei dl measures denoted below by R} and Ri ae Explained variance 301 11.8. Definitions of proportions of explained variance in two-level models In multiple linear regression, the customary R? parameter can be introduced in several ways: eg, ax the maximal squared correlation coefficient between the dependent variable and some linear combination of the predictor vari ables, or as the proportional reduction in the residual variance parameter ue to the joint predictor variables. A very appealing principle to define ‘measures of modeled (or explained) variation 's the principle of proportional reduction of prediction error. This is one of the defiaitions of Rin multiple linear regression, and can be described 2s follows. A population of values is given for the explanatory and the dependent variables, (Xt, .-» Xqis Ys with a known joint probability distribution; f ig the value for the vector © for which the expected squared error £0 - Soom Xu)? is minimal. (This is the defition of the ordinary least squares (‘OLS’) estimation criterion.) (In this equation, vo is defined as the intercept and ou = 1 for all i,j.) If, for a certain case i, the values of X, ‘unknown, then the best predictor for ¥; is its expectation €(Y), squared prediction exor var(¥,); ifthe values Xi. Xi are given, then the linear predictor of Yj with minimum squared ervor is the regression value YE SaXne. The diference betwoon the observed valu ¥, and the predicted sale Sq eX i the prediction error. Accordingly, the mean squared prediction err is defined as EM — An Xn)? r ‘The proportional reduction of the mean squared error of prediction is the same as the proportional reduction in the unexplained variance, due to the tse of the variables X; to Xq- In a formula, it can be expressed by 2 = Yer) ~ var(¥ = Fn AX) ® var(¥) var(¥i- Yon AnXns) var) ‘this formula expresies one of the equivalent ways to define F?, ‘The same principle can be wed to deine ‘explained proportion of variance in the Bierarchical near model. For this model, however, there are several options with respect to what one wishes to predict. Let ws consider a two-level model with dependent variable ¥. In such a model, one can choose between predicting an individual value Yy at the lowest level, or a group mean Yj. On the bass ofthis distinction, two concepts of explained proportion of variance in a tworlevel model can be defined. ‘The first, and most important, isthe proportional reduction of error for predicting an tndivédual ‘outcome. ‘The second is the propertionat reduction of error for predicting a group mean. To elaborate these concepts more specifically, first consider a =1-102 How much does the model explain? ‘two-level random effects model with 2 random intercept: and some predictor variables with fixed effects but 1o other random effects: = + Vm Xnis + Voy + Ry» (ray cad Since we wish to discuss the definition of ‘explained proportion of variance’ as = population parameter, we assume temporarily that the vector 7 of regression coeflicients is known. Level one For the level-one explained proportion of variance, we consider the prediction of Yj for @ randomly drawn leve-one unit within a randomly drawn level-two unit j. Ifthe values of the predictors Xij are unknown, then the best predictor for Yi is its expectation; the associated mean squared prediction error is var(Yij). If the value of the predictor vector Xj; for the siven unit is known, then the best near predictor for Yi is the regression value Dhso th Xnij (Orhere Xaaj is defined as 1 for all h, 3.) The associated ‘mean squared prediction error is var(¥iy — mn Xny) =o? + 1B. ® ‘The level-one explained proportion of variance is defined as the proportional reduction in mean squared prediction error: per gay — Wey — Ey mXng) Raa wey : (72) Now let us proceed from the population to the data. The most straightforward way to estimate Rf is to consider 42 + #2 for the empty model, Yi = 0 + Ung + By 5 (73) 1s wel as for the ftted model (7.1), and compute 1 minus the ratio ofthese ‘values. In other words, R} is just the proportional reduction in the value of @? +43 due to including the X-variables in the model. For a sequence of nested models, the contributions to the estimated value of (7.2) due to adding new predictors can be considered to be the contribution of these predictors to the explained variance at level one. To illustrate this, we once again use the data from the frst (balanced) example, and estimate the proportional reduction of prediction error for a model where within-group and betwee-groups regresion coficlents ma be different. =n * ‘Table 7.2. Estimating the level-one explained variance (balanced data). # HYG Pe es + By ‘Se04 2271 D. Yi =o + A(X Ky) + Xs + Uy +By 6973 0.991 Explained variance 103 rom Table 7.2 we se that d?+73 for model (A) amounts to 10.965, and for Jhodel (D) to 7.964. fi thus estimated to be I~ (7:964/10.968) = 0.274 Level two ‘The leve-two explained proportion of variance can be defined as the proportional reduction in mean squared prediction error, for the prediction of Fy for a randomly drawn leveltwo unit j. Ifthe values of the predictors Xhus for the level-one units 4 within level-two unit j are completely un- nown, then the best predictor for Vs is its expectation; the associated mean squared prediction ezror is var(¥-). Ifthe values of all predictors “Fay for all in this particular group j are known, then the best linear Jreictor for Py is the regression value Soy 7 ai the associated mean square prediction error is var? - Dm Xng) where nis the numberof leve-one units on which the average is based. The Jevel-two explained proportion of variances now defined as the proportional reduction in mean squared prediction error for Ys a 21 -w@a- Ean %ns) on % ae (74) ‘To estimate the level-two explained proportion of variance, we follow a similar approach as above: Fi is estimated as the proportional reduction in the value of @4/n + #3, where n i a representative value forthe group size. Tn the example given earlier, let us use for m a usual group size of n= 30. For model (a) the value of 32 n+ #3 is 8694/30 + 2.271 = 2.561, whereas for rode! (b) this amounts to 6973/30 + 0.991 = 1.223, Rf is thus estimated at 1 (1223/2561) = 0.52. 1 is natural that the mean squared error for predicting a group mean should depend on the group size. Often one can use for n a value which is deemed a priori to be ‘representative’. For example, if normal class size is considered to be 30, and even if because of missing data the values of ny in the data set are on average less than 30, itis advisable to use the representative value n = 90, In the case of varying group size, if itis “uncleer how a representative group size should be chosen, one possibility is to use the harmonic mean, defined by N/{5-,(0/n,)}. Tt is quite common that the data within the groups are based on a sample and not the entire groups inthe population are observed, so that the group Sizes inthe data set do not reflect the group sizes inthe population. In such 4 case, since itis more relevant to predict population group averages than sample group averages itis advisable to let n reflect the group sizes in the population rather than the sample group sizes. If group sizes are very largr, eg, when the leve-two units are defined by municipalities or other regions fnd the level-one units by their inhabitants, this means that, practically speaking, Fj is the proportional reduction inthe intercept variance.104 How much does the model explain? Population values of 2 and FE are non-negative What happens to F? and 2, when predictor variables are added to the raultilevel model? Is it possible that adding predictor variables lends to smaller values of Rf or A§? Can we even be sure at all that these quantities are positive? Tt tums out that a distinction must be made between the population parameters A? and F2 and their estimates from data. Population values of F} and PE in correctly specified models, with a constant group size zn, become smaller when predictor variables are deleted, provided that the ‘variables Uaj and Eij on one hand are uncorrelated with all the Xi; variables ‘on the other hand (the usual model assumption). For estimates of Fi and Fé, however, the situation is diferent: these cetimates sometimes do increase when predictor variables are deleted. When itis observed that an estimated value for Ror RZ becomes smaller by the addition of a predictor variable, or larger by the deletion of o predictor variable, there are two possibilities: either this is a chance Suctuation, or the larger model ia msspecifed. It will depend on how large the change in ‘i or Bj is, and on tho subject-matter insight ofthe researcher, whether the researcher will deem that either the fist or the second possibilty is more Tikely. In this sense changes in or A in the ‘wrong’ direction serve a8 2 diagnostic for possible misspecification, ‘This possibility of misspecification refers to the fxed part ofthe model, Le, the specification of the explanatory ‘ariables having fixed regression coeficients, and not to the random part of the model. We return to this in Section 9.23 14.9 Explained variance in three-level models In three level random intercept models (Section 4.8), the residual variance, or mean squared prediction variane, isthe sum ofthe variance components St the three level, o +12 +8. Accordingly, the level-one explained proportion of variance can be defined here as the proportional reduction inthe Sum of these three variance parameters. Example 7.1. Vorionce in maths performance explained by 10 Ta Bxataplo 48, Table 46 exhibits the rents of the empty model (Model 1) tnd a model in which IQ has a fixed effect (Model 2). The total variance in the cupty model is 1816 + 1746 + 2.124 — 11.686 while the eotal unexplained Seriance in Model 28 6910 +0701 + 1109 = 8.720, Hence the lvehone Tlaized proportion of variance is I~ (8.720/11.686) = 0.5. 14.4 Explained variance in models with random slopes ‘The idea of using the proportional reduction in the prediction error for Vij and Y 4, respectively, asthe definitions of explained variance at either level, can be extended to two-level models with one or more random regression coefficients. The formulae to calculate Rf and Fj can be found in Snijders fand Bosker (1994). However, the estimated values for Rj and 3 usually ‘change only very little when random regression coeficients are included in the model. Components of variance 105 ‘The formulae for estimating R and Rj in models with random intercepts only are very easy. Estimating F and 1 in models with random slopes is more tedious. The software paclage HLM (Bryk et al., 1996), however, provides the necessary estimates, since it not only produces estimates of ‘the variance components but also of the observed residual between group vary - Dm Xas) (75) Using this estimate, denoted in the HLM output as D.BAR, one can caleulate the estimate of Fj (as the proportional reduction in D-BAR). In HLM. versions 2.20 and higher, the output includes estimates of rf and the relia~ Dility of the intercepts, but no longer the D.BAR. However, this observed residual between-group variance can now be calculated as 72/reliability. Since the level-two variance is not constant in random slope models it is usually advisable here to center explanatory variables around their grand ‘means, so that F now refers to the explained variance at level two for the average level-two unit ‘The simplest possibility to estimate Rj and F in models with random slopes is to r-estimate the models as random intercept models with the same fixed parts (omitting the random slopes), and use the resulting parameter cstimates to calculate R} and RE in the usual (simple) way for random intercept models. Tis will usually yield values that ere very close to the values for the random slopes model. Example 7.2. Beplained variance for language scores In Table 54, a model was prevented for the data set on language scores in elementary schools used throughout Chapters 4 and 5, ‘When a random intercept model is fitted with the same fixed part, the cctimated variance parameters are ff = 7.61 for level two and @* = 39.82 {for lvel one. For the empty model, Table 4.1 shows that the estimates are #9 = 19.42 and 9? = 64.57. This implies that explained variance at level one is 1-(99.8247.61)/(64.57+19 42) = 0.44 and, using an average clas size n= 25, plained variance at level two is 1—|(39.82/25)+7.61|/(64.57/25)+19.42] 0.58. These explained variances are quite high, and can be attributed mainly ‘0 the explanation by 19, 7.2 Components of variance! The preceding section focused on the total amount of variance that can be explained by the explanatory variables. In these measzres of explained variance, only the fixed effects contribute. It can also be theoretically ihurni- nating to decompose the observed variance of Y into parts that correspond to the various constituents of the model. "This i discussed ia this section {or a two-level model For the dependent variable Y, the lev-one and level-two variances in the empty model (4.6) are denoied of and +B, respectively, The total "his sa more advanced sation which may be skipped by the reader.106 How much does the model explain? variance of ¥ therefore is of + rf, and the components of variance are the parts into which this quantity is splt. The first spit, obviously, isthe split Oto + rh into of and rh, and was extensively discussed in our treatment ‘of the intraclass correlation coeficient. To obtain formulae for a further decomposition, it i necessary to be more epeciic about the distribution of the explanatory variables. It usual insinglelevel as well asin multilevel regression analysis to condition on the values of the explanatory variable, Le, to consider those as given values. In this section, however, all explanatory variables are regarded as random variables with a given distsibtion, 72.1 Random intercept models For the random intercept model, we distinguish the explanatory variables in level-one variables X and level-two variables Z. Deviating from the notation in other parts of this book, matrix notation is used, and X and Z denote vectors, ‘The explanatory variables X;,..» Xp at level one are collected in the vector X with value Xisfor unit in group j. Tt is esoumed more specifically that Xij can be decomposed into independent leve-one and level io pats, Xi = XY + XP (78) (i, a kind of multivariate hierarchical near model without a fixed part). ‘The expectation is denoted EXy = px, the evel-one covariance matric is cov XY) = ER, and the level-two covariance matrix is cor(XP) = BE ‘This implies thatthe overall covariance matrix of X isthe sum ofthese two, cov(Xy) = BY + ER =Ey . Further, the covaciance matrix of the group average for a group of size n is cov(X 4) = day 408. Jk may be noted that this notation deviates slightly from the common split of Xy into Xy = (Xy —Xy) + Ky (7.7) ‘Phe split (7.6) is & population-based split, whereas the more usual split (7.7) js sample-based. In the notation used here, the covariance matrix of the within-group deviation variable is = Satay, while the covariance matic of the group means is cov(Xy) = doy +BR. For the discusion inthis section, Ue prevent notation Is more convenient. ‘The spit (7.6) is not a completely imocuous assumption. ‘The independence between XV and X? implies that the covariance matrix of the group cov(Xi5 ~ Components of variance 107 means is anges? than 1/(n ~ 1) times the within-group covariance matrix of X. ‘The vector of explanatory variables Z = (Ziy.~yZe) at level two has value Z; for group j. ‘The vector of expectations of Z is denoted £2; = 2, and the covariance matrix is cov(Z) = Bz» In the random intercept model (4.7), denote the vector of regression cocficients of the X's by yx = (T1051 90)! 5 and the vector of regression coefficients of the Z's by 12 = (F050 M4) ‘Taking into account the stochastic nature ofthe explanatory variables then leads to the following expression for the variance of Y: var(¥iy) = yx Ex1x + Ye Dae +73 + 07 =e BY ax + Ye BR ax + Ye Eee +gto. (78) {It may be illuminating to remark that for the special case that all explanatory variables are uncorrelated, this expression is equal to var(Yig) = Srteax) + Sos ey tito (this hold, eg feels only one leon ad only one lvetwo ox Slantory varie) This oma show that, nth seal case, he cont Botion ofeach explanatory variable othe vance of he dependent variable i gen by the product of the repreoncoeficet andthe variance of Che explanatory variable "The desompoaicn of X into independent levehooe and lveLtwo parts alow uso inleate preity whlch prt of (78) corexpond tothe uncom ditional level-one variance o% of Y, and which parts to the unconditional Tev wo variance" cha teh ax to? Bath Dhaw + ae Bere +8 ‘his shows how the within group variation of the Ine-one variables eats tome Par of the unconditional leve-one var pars of the leve-two Carnac ae eaten bythe variation ofthe levl-wo variable, and ao by the betwen group (compestion) ylation ofthe lewkone variable. Re call, however, the definition of D¥, which implies that the between-group section of X is taken nt of the Tendon? rasan of the group moa, That may be pected ven the witheroup vasiatin of Ky Tie word “anger iv met bere In the sense of the ordering of positive definite symmetric matrices108 How much does the model explain? 1.22. Random slope models For the herarcheal linear model in its general specification given by (6.12) a decomposition of the variance is very complicated because of the presence Of the cros-level interactions. Therefore the decomposition ofthe variance is discussed for random slopes models inthe formulation (5.14), repeated here a3 ¥ig = 0 + Downers + Vos + DUnjany + Ry (79) without bothering about whether some of the zy are level-one or level-two variables, or products of a level-one and a leveltwo variable. Recall that in this section the explanatory variables X are sechastc The vector X = (Xi,..,.Xq) of all explanatory variables has mean ux q) and covariance matrix Ex. The sub-veetor (Xj). Xp) of vatiabes hat have random slopes, bas mean p(y) and covariance matrix Ex(y. Thee covariance matrices could be split into within group and between-group Parts, but this is left up to the reader, ‘The covariance matrix of the random slopes (U3j,...,Uyg) is denoted Ty and the px 1 vector of the intercept-slope covariances is denoted Tio. With these specifications the variance of the dependent variable can be shown to be given by varl¥y) = 1 Exig ? +9 + te(e Tio + Hcg Ts Hxto + trace (Tha Exip)) + 0? (710) (A similar expression, but without taking the fixed ellects into account, is given by Snijders and Bosker (1993) as formula (21))) A bref discussion of all terms in this expression is as follows. 1. The first term, TEx 7s tives the contribution ofthe fixed effects and may be regarded as the ‘ex. Plained part’ of the variance. This term could be split into a level-one and a level-two part as inthe preceding subsection. 2. The part 18 + weg To + Hcg) Ta bx Gay should be seen as one piece. One could rescale all variables with random slopes to have 2 zero mean (cf. the discussion in Section 5.1.2); this would lead to x(q) = 0 and leave ofthis piece only the intercept variance 72. In other words, (7.11) i just te interoept variance after subtracting the mean from all variables with random slopes. 3. The part trace (Tir Ex) is the contrition ofthe random slopes to the variance of ¥. In the extreme ease thet all variables X; to Xy would be uncorrelated and have unit Components of variance 109 variances, this expression reduces to the sum of squared random slope vari- ‘ances. ‘This term also could be spit into a level-one and a level-two part. 4. Finally, e {s the residual level-one variability that can neither be explained on the basis of the fixed effects, nor on the basis of the latent group characteristics that are represented by the random intercept and slopes.8 Heteroscedasticity ‘The hierarchical linear model is a quite flexible model, and it has some other features in addition to the possibility of representing a nested data structure, One ofthese features is the possiblity of representing multilevel as well as single-level regression models where the residual variance is not constant. mn ordinary least squares regression analysis, one of the standard assumptions is homoscedastcity: residual variance is constant, Le, it does not depend on the explanatory variables. This assumption was made in the preceding chapters, e,, for the residual variance at level one and for the intercept variance at level two. The techniques used in the hierarchical linear model allow to relax this assumption and replace it by the weaker assumption that variances depend linearly or quadratically on explanatory variables. Thus, an important special ease of heteroscedastic models (Le, models with heterogeneous variances) is obtained, viz, hetoroscedasticity where the variance depends on given explanatory variables. This feature 4s implemented at this moment in the MLn/MLwiN program and in HLM version 5, and can also be obtained in SAS (c£. Chapter 15). This chapter treats a two-level model, but the techniques treated (and the software mea tioned) can be used also for heteroscedastic single-level regresion models. 8.1 Heteroscedasticity at level one 8.1.1. Linear variance functions In a hierarchical linear model it sometimes makes sense to consider the possibility that the residual variance at level one depends oa one of the predictor arable. An example is stution whee two meanrenent instruments have been used, each with a diferent precision, resulting in ty diferent vale forthe meuurenent evo varlance Which bo component of the level-one variance. If the levelone residual variance depends linearly on some variable X; it can be expressed by . level-one variance =o} + 20012145, @) where the value of X; for given unit is denoted by au while the random part at level one now has two parameters, of and o9:. The reason for incor- no Heteroscedastcity at level one ret poratig the factor 2 wil become clear Inte, when also quadratic variance fanetions are considered. For example, when X; is a dummy variable with values 0 and 1, the residual variance is 02 for the units with X, = 0 and 0 +200: for the units wvith X, = 1 When the level-one variance depends on more than variable, Ther effects can be added to the variance function (8.1) by adding terms Doe 2a, xample 8. Resiuclvorionct depending on gender. qn the example wed in Chapters and, the resdual variance might de pun oa the pupils gender, To Investigate this i a model that isnot overly Tomplcntd, we tke the model of Table 54, delete the effets of malt grade Stowe, anda he elect of gender (x dummy tvable which i fr bars tod 1 fr gb) Table 61 peseatsetimate fortwo modes: one with constant residual vaviaon, and oe with rival variances depeaiog oo gender. Tau, Model Tova bomowcedatc model and Model? gende-depeadent beteroncedastic rode Mods! Fined Et Toate © “face 0S OS 19 2288 ‘2254 0081 SES ose dist ool Gender 26 266025 1G 102 iol 0a Random Efe! Parameier SE. Parameter 8.8: a rr Invercepe varace aor 1a) m1 1Q slope variance fis 00s mt 088 Inererpt iQ slope covaciance 0.75028, 0.78028. Level one eartanor parameter: 8 conmast vere. ars LT 2 16T aT 201 gender eft Deviance 150055 150084 ‘According to formula (81), the resieal variance in Model 2 38°72 for bays and $8.72 2 1.21 = 36.0 for girls. ‘The residual variance estimated in the homeecedastic Model 1is very close w the average of che oo figures. ‘This is natura, since about hal ofthe pupilsare girls and hal are boys. The dlifereace between the two valance i, however, ot significant: the deviance test yields x? = 15005. ~ 180084 1.1, df. =}, p> 02. The Seed effect of geader is quik significant (¢ = 264/026 = 10:29 < 0.0001), Controling for TQ and SES, glo score on average 2.6 higher than boys ‘Analogous to the dependence due to the multilevel nesting structure as discussed in Chapter 2, heteroscedasticity has two faces: it can be ant sance and it can be interesting, It can be a nuisance because the failure toa2 Heteroscedasticity take it into account may lead to a misepeciied model and, hence, incorrect parameter estimates and standard erors. On the other hand, it can also be an interesting phenomenon in itself. When high values on some variable X are associated with a higher residual variance, this means that forthe units who score high ou X; there is, within the context of the model being considered, more uncertainty about their value on the dependent variable Y. ‘Thus, it may be interesting to look for explanatory variables that differenti ate especially between units who score high on X;. Sometimes a non-linear fanetion of 2, ot an interaction involving X, could play such a role. Example 8.2 Heteroscedaticity related to 10, Continuing the previous example, its now investigated ifresidual variance depends on 1Q. The corresponding parameter estimates are presented as Model 3 in Table 8.2. Heteroscedastcity at level one us 1 so-called splize function’ than by a polynomial function of IQ. Specifically, the coeficient of the aquare of 1Q turned out to be different for negative than {or postive 1Q values (recall that TQ was standardized to have an average of 0). This is represented in Model 4 of Table 8.2 by the variables . 1g t1Q <0 w= fo i950, a a. fo, #1 <0 a = IQ if1Q 20. [Adding these two variables to the fixed part gives a quite significant decrease of the deviance (61.9 for two degroes of freedom) and completely takes away ‘the random slope of IQ. The IQ-related heteroacedasticity, however, becomes ven stronger. The total fixed eect of 1Q now is given by 2.29319 + 0.26610? if1Q <0 vans otra={ $2519" oni 1g So ws) Table 8.2 Heteroscdastic models depending on 1Q. Mods! 3 Mode ¢ ‘The graph of this effet is shown in Figure 8.1. tig an increasing function Sc ‘hich flattens out for low and fr igh values of 1Q in e way that cannot be wet Ost ars oa ‘rll represented by a quadratic or bie function. 2m 0am “ate sr ence oz ot 0308 0099 y| ous coe 01 Sane 4 2st 028 © 235, oas 12 0 iat Oa 7 Random Eft Pannier SE Parmeter SE. Teel oo rondaes PERF = ye 71 19 Inveceptvrtance soo tak nasa 1Q slope variance ois cam oa IQ sope covarance 05102 = 00 Lame-one variance pormetr 4 conta toe m1 sap aeaaee ~Ro 025-2970 igure 8.1 Bfoct of 1Q on language tet Deviance 4960.0 1908.1 Comparing the deviance to Model 1 shows that there is a quite significant heteroscedasticity asociated with IQ: x* = 15005.5 ~ 14960.0 = 45.5, d.f. = 4, p-<0.0001. The level-one variance Function is (ef. (8.1)) 37.83 — 40219 ‘This shows that language scores of the lew intelligent pupils are more variable ‘than language scores of the more intelligent. The standard deviation of 1Q is 207 and the mean is 0. Thus, the range of the level-one variance, when 1Q isin the range of the mean + trice the standard deviation, is between 29.79 ‘and 45.87. This is an apptecable variation around the average value of 37.56 ‘estimated in the homoscedastie Model 1. Prompted by the IQ-dependent heteroscedasticity, the data were explored for effects that might diferetiate between the pupils with lower 1Q scare. Non-linear effects of 1Q and some interactions involving IQ were tried. Tt appeared that a non-linear effect of 1Q is discernible, represented better by ‘This is an interesting turn of this modeling exercise, sad a nuisance only because it indicates once again that school performance S a quite complicated subject. Our interpretation of this data set toward the closing of Chapter 5 ‘was that there is a random slope of 10, i, school differ in the effect of IQ [Now it turns out that the data are clearly beter described by a model in ‘which 1Q has a non-linear effect, the effect of 1Q being stronger inthe middle range than toward its extreme values; in which the effect of 1Q does not vary across achools; and in which the combined effects of 1Q, SES, gender, and the achool-average of 1Q predict the language scores for high IQ much better than for low-1Q pupil T Spline fanctions (introduced it more extensively in Section 12:22 and treated more {ally ey in Seber aod Wild, 1089, Section 98) are & mere flexible clas of functions ‘han polyaomials They ave polynomials f which the couficients may be diferent oo tevera intervalsua Heteroscedasticity 8.1.2 Quadratic variance functions ‘The formal representation of lvel-oneheteroscedaticity is based on includ- Ing random effects at level one, a8 spelled out in Goldstein (1995, see the remarks on a complex random part at level one). Ia Section 5.1.1 it was re- ‘marked already that random slopes, Le, random elects at level two, lead to heteroscedasteity. Tis also holds for random eects at level one. Consider a tworlevel model and suppose that the level-one random partis random part at level one = Ray + Ruy =u - (4) Denote the variances of Roy; and Ray by of and of, respectively, and their covariance by oo. The rules for ealcslating with vasiances and covariances imply that var(Roy + Riss) =08 + 20m ay + of hy (6s) Formula (8.4) i just a formal representation, wed in MLn/MLwiN to specify this heterscedastic modd For the interpretation one should rather fook at (8.5). This formula can be used without the interpretation that of and o? are variances and og: a covariance; these parameters might be any fnumbers. ‘The formata only implies that the residual variance ia a quadratic function of 21. In the previous section, the case was Used where of = O, producing the linear function (8.1). Ifa quadratic function is desired, all three parameters are estimated from the data. Example 8.3 Educational level afained by pupils ‘This example i about a cohort of pupils entering secondary school in 1989, studied by Deldere, Bosker, and Driesen (1998). The question is, how well the educational evel attained in 1995 ean be predicted from individual characteristics and school achievement at the end of primary school. The data is about 15,007 pupils in $69 secondary schools. The dependent variable is ‘an educational attainment variable, being defined as 12 minus the minimum number of additional years of schooling t would take theoretically for this pupil in 1095 to gain a certificate giving access to university. The range is ‘(to 13 (eg., the value of 13 means that tbe pupil aleady is a first year university student). Explanatory variables are taicher’s rating at the end of primary schoo! (an advice on the most suitable type of secondary school, range 1 t0 4), achievement on three standardized test, so-called CITO tests, fon language, arithmetic, and information processing (all with a mean value ‘round 11 and a standard deviation between 4 and 5), socio-economic sta- ‘tus (a discrete ordered scale with values 1 to 6), gender (0 for boys, 1 for fGrls), and minority satus (based on the parents” country of birth, O for the Netherlands and other industrialized countries, 1 for other countries). ‘Table $3 presents the results of a random intercept model as Model 1 [Note that standard errors are quite small due to the large number of pupils in this data set. ‘The explained proportion of variance at level 1 is Rf = (0139. Model 2 shows the results for a model where residual variance depends ‘quadratically on SES. It follows from (8.5) that residual variance here is given by residual variance =3475 — 0.632SES + 0.056SES* The deviance difference between Models 1 and 2 (x? = 973, df. = 2, p< (0.0001) indicates that the dependence of residual variance on SES is quite Hteroscedastcity at level one 18 significant, The variance function decreases curvilinear from the value 291 for SES = 1 to a minimum value of 1.75 for SES = 6. This implies that ‘when educational attainment is predicted by the variables in this model, the ‘ancertainty in the prediction is highest for low-SES pupils. It is reasuring that the estimates and sandard errors of other effects are not appreciably diferent between Models 1 and 2 ‘The specification of this random part was checked in the following way. First the models with oaly a linear or only a quadratic variance term were cstimated separately. This showed that both variance terms are significant. Further, it might be possible that the SES-dependance of the residual variance is a random slope in disguise. Therefore a model with a random slope (ie, ‘random efect at level two) for SES also was fitted. This showed that the random slope was barely sigaiicant and not very lage, and did not take away the heteroscedasicity effet. ‘Table 8.3 Hetercecedastic model depending quadratically on SES. Model 1 Model 2 Caalicient SE, fest SE ‘8m eat — 0087 oo 002 = OasT 00. 0867 0.0042 0s78 0.042 Coes 0.0089 on648 0.0049 Ose 0.0049 O98 0.0048 oie 00191650013, oa 0.08 = 03s oz Parameter SE. Parmeter SB. oi ools oak LescLone variance parameters 952 0003475 OMT 0316 0064 oF quadratic SES eect 00s aie Deviance sms. 401078 ‘The SES-dependent hetercecedastcity led to the consideration of non-linear cefects of SES and interaction effects invalving SES, Since the SES variable ‘soumes values 1 through 6, ve dummy variables were used contrasting the respective SES values to the reference category SES = 3. In order to limit ‘the number of variables, the interactions of SES ware defined as interaction: with the numerical SES variable rather than with the categorical vara xepresented by theoe dummies. For the same reason, for the interaction of SES with the CITO teats only the average ofthe three CITO tests (cange 1 to 20, mean 11.7) was considered, Product interactions were considered of SES with gender, with the average CITO score, and with minority statas. AAs factors in the product, SES and CITO tests were centered approximately by using (SES —3) and (CITO average — 12). Gender and minority status, being 0-1 variables, did not need to be centered ‘This implies that although SES is represented by dummies (Le, as categorical variable) in the main effec, itis used a8 a numerical variable in6 Heteroscedasticity tbe interaction fects, (The saint of SES as nummzl variable a ‘pity abo cluded, eae can be eprenated abo a a eect he Gimay variables, Torre hn sumed SES vane does Soe ned cote ‘ied tothe model) Minority status does not have sgnicant main effect when added Movil 2 bt the main er ons nde to tactateerpaion of He Eteaction eos The rns ae fu Tale 84 ‘Table 8.4 Heteroecedastic model with Interaction effects. Modet 3 Ri Eteet_——“oatcent opi 6.308 0060 ‘Teacher's eating ose 0.003 ITO Anthmetic (0.0568 o.oo CCETO Information 2.0552 0.050, 0.0355 c.0049 -oe7 Oss 70.390 0.050 018) 00m 0389 Goa? 0500 Garo Oss aaa 005 O64 “oom o.008 2.0056 0.0035 SES X Minority 0219 a. Random Effect Parameter 8.8, Fue random Pa Inuereae variance ons aoe Lgoa-one seriance parameter 7} conmant ter Bem oes yy line SES eee 0.302.064 of quadratic SES eft 0.05306 a Model $ as a whole is a strong improvement over Model 2: the deviance difference is x? = 4438 for d.f. = 7(p < 0.0001). For the evaluation of the non-linear effec: of SES, note that since SES = 3 isthe reference category, the parameter value for SES ~ 3 should be taken as 0.0. This demonstrates that the effect of SES is non-linear but it is indeed an increasing function of SES. The differences between the SES values 1, 2, and 3 are larger than those between the values 3, 4,5, and 6, The interaction effets of SES with sender and with minority status are significant. The main elfect of minority status corresponds with its effet for SES = 5, since the product variable was defined using (SBS ~3), and is practically nil. Thus, it turas out that, when the other included veriables (including tle asia effect of SES!) are controlled for, paps with parents fom a non-industralined country attain « higher ‘educational level than those with parents fom the Netherlands or azother industrialized country when they come fom a low-SES family, but a lower level if they come from a high-SES family. Heteroscedasticity at level one ur In Model 3, residual variance is residual variance = 3.422 ~ 0.604SES + 0.058 SES*. ‘This decreases Rom 2.87 for SES = 1 to 1.71 for SES = 6. Thus, with the inclusios of interactions and a nor-linear SHS effet, residual variance has Tardy Careased, “Model 3 was obtained, on the spur of eteroscedasticity, ina data-driven rather than theory-driven way. Therefore one may question the validity of the teats for the newly included effects: are these not the result of chance capital- jaation? The interaction effect of SES with minority status has such a high t-value, 44, that is significance is beyond doubt, even when the data-driven, selectio: i taken into account. For the interaction of SES with gender this is leas cle. For convincing hypothesis tests, it would have been preferable to ‘use eros validation: spt the data ito two subsets and use one subset forth ‘model selection and the other for the tes of the elects. Since the two subsets should be independent, it would be best to select half the schools at random. ‘and use all pupil in these schools for one subset, and the other schools and ‘heir pupils forthe other. More generally, the residual variance may depend on more than one variable; n terms of representation (8.4), several variables may have random effects at level one. These can be level-two as well as level-one variables. If the random part at level one is given by random part at level one = Rois + Bus zug + + Rts ps» while the variances and covariances of the Ryj are denoted of and ons, then the variance function is, residual variance = oh zh +2D° D> oneangzy- (8-6) = Soden This complex level-one variance funetion can be used for any values for the parameters of and oya, provided that the residual variance is positive. The Simplest cae is to include only 02 and the ‘covariance’ parameters coh, leading to the near variance function residual variance =o} + 200: 21y + 2ow tay + + oop ps, Correlates of diversity It can be important to investigate the factors that are associated with outcome variability. For example, Raudenbush and Bryk (1987) (see also Bryk ‘and Raudenbush, 1992, p. 169-172) investigated the effects of school policy and organization on mathematics achievement of pupils. They did this by considering within-school dispersion as a dependent variable. The preceding section offers an alternative approach which remains closer to the hhjerarchical linear model. In this approach, relevant level-two variables are considered as potentially being associated with level-one heteroscedasticity. Example 8.4 School composition and outcome variability Continuing the preceding example on educational attainment predicted from. data amilable at the end of primary school, itis now investigated whetherus Heteroscedasticity composition of the school with respect to socioeconomic status lt astoc- ated with diversity in later educational attainment. It turns out that socio. economic status has an intraclass correlation of 0.25, which is quite high ‘Therefore the average socioeconomic status of schools could be aa important factor in the withinschoo! processes associated with average outcomes but also with outcome diversity ‘To investigate this, the school average of SES was added to Model 3 of ‘Table 84 both as a fced effect and as a linear effect on the levelone variance, ‘The non-significant SES-by-CITO interaction was deleted from the model ‘The school average of SBS ranges from 1.4 to 5.5, has a mean of 3.7, and 8 standard deviation of 0.59. This variable is denoted by SA-SES. Its fixed tffect is 0.417 (SE. 0.109, t= 38). We farther present only the random part of the resulting model in Table 8.5. ‘Table 8.5 Heteroscedastic model depending on average SES. Model 4 Tandon Be Furameter SE Tesel: too random @ Interept variance oun ois [eve-one serance parameters: 2B constant term ‘320 zee oy Hiner SES eect 0.381.068, oF quadratic SES eect 0078 ona 0s Hoear SA-SES eect 0205 0.002 Deviance 58 "To test the effect of SA-SES on the leve-one variance, the model was esti- ‘mated also without this effect. This yielded a deviance of 46029.6, 20 the tet statistic is x? = 46029.6 — 45898.0 = 136.6 with df. = 1, whlch is very ignfcant. ‘The quadratic effect of SA-SES was estimated both as a main effect and for levelone heteroscedasticity, but nether were significant How important is the effec: of SA-SES on the level-one variance? The ‘standard deviation of SA-SES is 0.59, so four times the standard deviation (the difference between the fow percent highest-SA-SES and the few percent lowest-SA-SES schools) leads to a diference in the residval variance of 4 x 0.59 x 0.265 = 0.63, For an average residual level-one variance of 2.0 (see ‘Model 1 in Table 8), this is an appreciable difference. ‘This ‘random effect at level 1’ of SA-SES might be explained by interactions between SA-SES and pupil-level variables. The interactions of SA-SES ‘with gender and with minority status were considered. Adding thase to Model 4 yielded interaction effects of ~0.219 (SB. 0.096, t = ~2.28) for SA-SES by ainority status and ~0.225 (S.E. 0.050, = 4.50) for SA-SES by gender. This ‘implies that, although a high school average for SES leads to higher educa fiona attainment on average (che main effect of 0.417 reported above), this effect is weaker for minority pupils and for gels. These interactions did, however, not lead to a noticeably lower effect of SA-SES on the residual level-one variability, Heterosceiastcity at level two 9 8.2 Heteroscedasticity at level two the intercept variance and the random slope variance in the hierarchical Tear model ft was assumed in preceding chapter that they are constant actos groups. This i @ homoscedasticity assumption at level two. If there Se terete! or ep! rns to drop thi asumtion could be laces by the weaker assumption that these variances depend on some Tee, Masta Z. For camp #2 a dummy vale with vlus (O and 1, distinguishing two types of groups, the assumption would be that the intercept and slope variances depend oa the group. In this section we only discuss the ease of level-two variances depending on a single variable 7; Wis discussion can be extended to variances depending on more than -two variable : ne et wr or tercept variance depends linearly or quadratially on a variable Z. The Intercept variance then can be expressed by intercept variance = 72 + 2mzj + 772}, 87) for parameters 73, mo, and 7?. For example, in organizational research, trhen the level-two units are organization, itis possible that small-sized ‘reatizations are (because of greater specialization or other factors) more diferent fom one another than large-sized organizations. Then Z could be ome measure forthe size ofthe organization, and (8.7) would indicate that it depends on Z whether diferences between organizations tend to be small or lage; where ‘differences’ refer to the intercepts in the multilevel model. ‘The expression (8.7 sa quadratic function of 2, 0 that it can represent @ ccurvinear dependence ofthe intercept variance on Z. Ifa linear function is ted (ce, 1? = 0), the intercept variance is ether increasing or decreasing over the whole range of Z. ‘anlogous to (6), this vince function can be obtained by using the ‘random part? random part at level two = Uoy + Uijy + 8) strange as it may sound, th leve-two variable Z formally gets a ran- cn ee twa Note tata Seton Sl has tae’ ht random slopes for level-one variables also create some kind ofheteroscedas- tity vin, heteroscedaticity of the observations ¥.) ‘The parameters 72, 7?, and 7. are, like in the preceding section, not to be interpreted themselves as variances and a corresponding covariance ‘The interpretation is by menns of the variance function (88), Therefore it is oe required thet 1, 0) ‘Use in the fixed part the linear effect X, the quadratic effect (X = 20), and the effects ofthe ‘half squares! f(X),-~) fx(X)- Together these functions can represent a wide variety of smooth functions of X, as is evident from the Ggures on pages 113 and 217. some of the f4(X) have non-significant ffects they can be lt out of the model, and by tral and error the choice Sf the soiled noses 24 may be improved. Such an explorative procedure ‘was used to obtain the functions displayed on the mentioned pages. 9.4 Specification of the random part ‘The specification of the random pest was discussed already above and in Section 64. If certain variables are mistakenly omitted from the random part, the tests of their fixed coefficients may also be uareliable. Therefore it {s advisable to check the randomness of slopes of all variables of main inter- fest, and not only those for which a random slope is theoretically expected. "The random part specification is directly linked to the structure of the covariance matrix of the observations. In Section §.1.1 we already saw that a random slope implies a heteroscedastic specification of the variances of the observations and of the covariance between level-one units in the seme group (evel-two unit). When diferent specifications of the random part yield a similar structure for the covariance matrix, it wll be empirically difficult or impossible to distinguish between them. But also a misspecification of the fixed part of the model can lead to a misspecification of the random pats sometimes an incorrect fixed part shows up in an unnecessarily complet random part. For example, if an explanatory variable X with a reasonably high intraclass correlation has in reality a curvilinear (e.g,, quadratic) effect without a random component, whereas itis specified as having a linear fixed ‘and random effect, the excluded curvilinear fixed effect may show up in the pe of a significant random effect. The latter effect then will disappear126 Assumptions of the hierarchical Knear model when the corectly specified curvilinear efect is added to the fixed part of the model. This was observed in Example 8.2. The random slope of 1Q disappeared in this example when a curvilinear effect of 1Q was considered 94.1 Testing for heternsendasticity In the hierarchical linear model with random slopes, the observations are heteroscedastic because their variances depend on the explanatory variables, as expressed by equation (6.5). However, the residuals Rij and Ung are assumed to be homoscedastc, ie., to have constant variances ‘In Chapter 8 it was explained that the hierarchical linear model can also represent models in which the level-one residuals have variances depending linearly or quadratically on an explanatory variable, say, X. Such a model can be specified by the technical device of giving this variable X a random slope at eval one. Similary, giving a level-two variable Z a random slope at level two lends to models for which the level-two random intercept variance depends on Z. Also random slope variances can be made to depend on some variable Z. Neglecting such types of heteroscedasticity may lead to incorrect hypotheses tests for variables which are associated to the variables responsible for this heteroscedasticity (X and Z in this paragraph). Check- ing for this type of heteroscedasticity is straightforward using the methods of Chapter 8. However, this requires that variables are avallable that are thought to be possibly associated with residual variances ‘A clifferet method, dexcribed in Bryk and Raudenbush (1992, Chapter 9), can be used to detect heteroscedasticity in the form of between-group differences in the level-one residual variance, without a specific connection to some explanatory variable. It is based on the estimated least-squares residuals within each group, further called the OLS residuals. This method only is appicable if many (or all) groups are considerably larger than the number of explanatory variables. Only level-one explanatory variables are ‘considered, the level-two variables are disregarded. What follows applies to ‘the groups for which ny —r — 1 is not too small, say, 10 oF more, where r {is the number of leve-one explanatory variables. For each of these groups separately en ordinary least squares regression is carried out with the level- ‘one variables as explanatory variables. Denote by 63 the resulting estimated residual variance for group j and by fy = ny —T—1 the corresponding ‘number of degrees of freedom. The weighted average of the logarithms, Ie = Hine) Dye’ rust be calcalated. If the hierarchical linear model is well-specified this weighted average [ine wil be close'to the logarithm of the maximum lke- hood estimate of o# From the group-dependent residual variance of a standardized residual dispersion measure can be calculated using the formula (9.2) 4, Es (inl) — tne} - 03) ‘Specification of the random pat a7 I the level-one model is well specified and the population level-one residual ‘variance is the same in all groups, then the distribution of the values dy is ‘lose to the standard normal distribution. The sum of squares, m=y4, 4) 7 can be used to test the constancy of the level-one residual variances. Its all distribution is chi-squared with NV — 1 degrees of freodom, where NV is the number of groups included inthe suramation. If the within groups degres of freedom dy are les than 10 for many or ail groups, the mul distribution of H ia not chi-squared. Since the mul distribution depends only on the values of df; and not on eny ofthe unkown, ‘parameters, it is feasible to obtain this null distribution by straightforward Computer simulation. This can be carried out as folows: generate independent random variables Vj according to chi-squared distributions with fy degrees of feedom, ealeulate sf = Vj/dfj, and apply equations (9.2), (9-3) and (9.4). The resulting value H is one random draw from the correct null distribution. Repeating this, say, 1,000 times, give a random sample from the null distribution with wich one can compare te observed value from the real dataset. If this test yields a significant result, one can itspect the individual dj values to investigate the pattem of heteroscedastcty. For example, it is possible that the heteroecedastcity is due to afew unusual level-two units for which dj has a lage absolute value. Example 9.1 _Levelone heteroscedasticity ‘The example of pupils’ language performance used, ¢g,, in Chapters 4 and 5 is considered again. We investigate whether there is evidence of levehone heteroscedasticty where the explazatory variables at level oue are 1Q, SES, and geader (thie isthe same levehone model as in Table 81) "Those groups were used for which the residual degrees of freedom are at least 10. There were 86 such groups, The sum of squared standardized Fesidual dispersions defined in (94) is H = 79.165, a chi-squared value with. Gf. = 85, p= 0.66. Heace this tes does not give evidence of heteroscedas- tety. ‘All values were smaller in absolute value than 23. This is quite reason able fora sample of 86 standard normal deviates, and therefore confirms the Conchision that there is no evidence that some groups bave a larger within- ‘soup residual variance than others. ‘An advantage of this testis that it is based only on the specification of the within-groups regression model. The level-two variables and the level two random effects play no role at al, so what is checked here is purely the level-one specifcatiou. However, the mentioned null distsibutiona of the dj and of H do depend on the normality of the levetone residuals. A more hheavy-tailed distribution for these residuals in itself will also lead to higher values of H, even if the residuals do have constant variance. Therefore, if H leads to a significant result, one should investigate the possible pattern of128 Assumptions of the hierarchical linear model heteroscedasticity by inspecting the d; values, but one should also inspect the distribution of the OLS within-group residuals for normality 94.2 What to do in case of heteroscedaticity [there is evidence for heteroscodasticity, i may be posible to find variables accounting forthe different value ofthe level-one residual vaziance, These could be level-one as well as level-two variables. Sometimes such variables can be proposed on the bass of theoretical considerations. In addition, plots cof dj versus relevant level-two variables, or plots of squared unstandardized residuals (see Section 9.5) can be informative for suggesting such variables, ‘When there isa conjecture that the non-constant residual variance is e500. ciated with a certain variable, one can apply the methods of Chapter 8 to test whether ths is indeed the caze, and ft a heteroscedastic model, In some cases, a better approach to deal with heteroscedastiity is to apply @ now-inear transformation to the dependent variable eg. a square root or logarithmic transformation. ‘This can be useful, eg,” when the dependent variable is highly skewed. How to choose transformations of the dependent variable in single-level models is discussed in Atkinson (1985), ‘The use of the Box-Cox transformation family for multilevel models is dis, cessed by Hodges (1998, p. 506). When there i heteroscedasticty and the dependent variable has a small number of categories, another option 's to we the multilevel ordered logit model for maltiple ordered ealagrice (Gestion 14.4) or to dichotomize the variable and apply multilevel lopstic regression (Section 142). 9.5 Inspection of level-one residuals AA plethora of methods have been developed for the inspection of residuals im ordinary least squares regression; see, eg., Atkinson (1985) and Cook and Weisberg (1982, 1994). Inspection of residvals can be used, eg, to ‘find outlying cases that have an undue high influence on the results of the Statistical analysis, to check the specification ofthe fixed part of the model, to suggest transformations of the dependent or the explanatory variables, oF to point to heteroscedasticity, ‘These methods may be applied to the hierarchical linear model, but, some changes are necessary because of the more complex nature of the hierarchical Linear model and the fact that there are several types of residuals, For example, in a random intercept model there is a level-one and also a level-two residual. Various methods of residual inspection for multilevel models were proposed by Hilden-Minton (1995). He noted thet a problem {in residual analysis for multilevel models is that the observations depend on ‘the level-one and level-two residual jointly, whereas for model checking itis desirable to consider these residuals separately. Tt turns out that level-one residuals can be estimated so that they are unconfounded by the level-two residuals, but the other way around is impossible. Inspection of level-one residuals 129 ee ee fpecerrpes mete msnrnr re ee elytra ae ee reas mbcagentny acrsegn Ei cies eneceaaee ern anya Oo mui _ ce et i a ae an recta corse te po ae eae ere one, her ° asa tnsiners ashe rium ie ol a on ma mdi rane mmo ance ae se ae eh sec oer, a ae ec cleate iret on ip ge ce ae een “ ee an a aah a td ea poe seme nei et ee 1 Pe hs nfatind O nti pit et ey Pe coe ous ie ra ena ee ae eee wa Bataan pater, in wich cae advan to make & eS a henson Son eee easter eee ce se ate at rae ee a athe rane cia seated area Se a a dependence of the residuals.) . seers eat tn anal ry i Nout Ee oe eee 3 oe bao coer A See wl mac sideration and by r1,.~.ra¢ the corresponding values of the seals, ie gir gis cr aay me a ee ene a da el er eee itrarily.) Since ‘with a small number of units may have Uitte mmber fay bey tan the cig! totl sample si springen Mon tg tes i amt130 Assumptions of the Rerarchical near model values, 1 nome 3 for i= K to M~K. The value of K will depend on data and sample size; eg., ifthe total number of residuals M is at least 1,500, one could take moving averages of 2K = 100 values. One may add to the plot horizontal lines plotted at r equal to plus ‘or minus twice the standard error of the mean of 2K values, ic, 7 = £2.//a?/[@K), where 3? is the variance of the OLS residuals. This indicates roughly that values f; outside this band, Le. [fl > 2V@/@K), may be considered to be relatively large. 2. Make a normal probability plot of the standardized OLS residuals to check the assumption of a normal distribution, This is done by plot ting the values (si,F:) where F; isthe standardized OLS residual and =, the corresponding normal sore (Le. the expected value from the standard normal distribution according to the rank of fi). Especially when this shows that the residual distribution has longer ‘ails than the nor- ‘mal distribution, there is a danger of parameter estimates being unduly influenced by outlying level-one units. ‘When adata exploration is carried out along these lines, tha may suggest ‘model improvements which then can be tested. One should realize that these tests ae suggested by the data if the tests use the sarve data that were used to suggest them, which is usual, this will lead to chance capitalization and inflated probabilities of erors ofthe frst kind. The resulting improvements are convincing only if they are ‘very significant (eg, p < 0.01 or p < 0.001) If the data set is big enough it ls preferable to employ cross-validation, ef pea, Example 9.2 Level-one residual inspection. We continue Exampe 9.1, ia which the data set also used in Chapters 4 and 5 is considered agaia, the explanatory variables at level one being 10, SES, and gender. Wichin-group OLS residuals were calculated for all groups with at least 10 within-group residaal degrees of freedom, ‘The variables 1Q and SES both have a limited numberof categories. For adh category, the mean residual was calculated and the standard error of this mean was calculated in the usual way. The mean residuals are plotted in Figure 91 forthe categories costaining 10 or more pupils. ‘The vertical lies indicate the intervals bounded by the mean residual plus or minus twice the standard error of the mean. For 1Q the mean reald- uals exhibit a clearer pattern than for SES. The left-hand figure suggests & ar function of 1Q which has a local minimum for 1Q between ~2 and 0 Inspection of level-one residuals 131 Figure 0.1 Meas levelone OLS residuals (with bars extending to twice the standard error of the mean) as function of 1Q (lft) and SES (right) ‘and @ local maximum for 1Q somewhere around 2. There are few pupils with 1Q values less than ~3 or greater than +4, so in this 1Q range the error bars are very wide and not very informative. ‘Thus the figures point toward a non- linear effect for 1Q, having a local minimum fora negative 1Q value and a local maximum for a positive 1Q value. Examples of such functions are a third degree polynomial and a quadratic spline with two nodes (cf. Section 12.2.2). ‘The fire option was explored by adding 1Q* and 1Q° to the fixed part. The ‘second option was explored by adding IQ? and IQ, as defined in (8.2), to the Bxed part. ‘The second option gave a much better model improvement and cherefore was selected. When 1Q and 1Q are added to Model 1 of ‘Table 8.1, the deviance goes down by 458 (d.f. = 2, p < 0.00001), This is strongly significant, so vis non-linear effect of 1Q is convincing even though it was not hypothesined beforehand but suggested by the data. "The mean residuals for the resulting model are grapbed as functions of 1Q ‘and SES in Figure 0.2. Figure 0.2 Mean level-one OLS residuals (with bars extending to twice the standard error of the mean) ae function of 1Q (lft) and SES (right), for mode! ‘with non-linear effet of 1Q132 Assumptions of the hierarchical linear model ‘These plots do not exhibit a remaining non-linear effec: for 1Q. For SES fone could be tempted vo see a curvilinear pattern, but explorations with a ‘third power of SBS and with some quadratic splizes did not produce any convincing non-linear effects ‘A normal probability plot of the residuals for the model that includes the not-linear effect of 1Q is given in Figure 9.3. The distribution looks quite normal except for the very low values, where the residuals are somewhat more strongly negative (Le, larger in absolute value) than expected. However, this deviation from normality is rather smal obeerved S-2-10 1 3 3 “pein 8 Normal probability plot of standardized leve-one OLS residuals, Figure 9.6 Residuals and influence at level two Estimated level-two residuals always are confounded with the estimated level-one residuals. Therefore one should check the specification ofthe level- fone model before moving on to checking the specification at level two. 9.6.1 Empirical Bayes residuals ‘The empirical Bayes estimates (also called posterior means) of the level-two random effects, treated in Section 4.7, are the ‘estimates’ of the level-two random variables Uns (h = 0,..p) in the model speciication 9.1. ‘These can be used as estimated level-two residuals. They can be standardized by dividing by their standard deviation, Similarly to the level-one residuals, one may plot the unstandardized level-two residuals as a function of relevant level-two variables, and make normal probability plots of the standardized level-two residuals. Smoothing the pots will be less necessary because of the ‘usually much smaller number of level-two units. If the plots contain outliers, ‘one may inspect the corresponding level-two units to check whether anything ‘unusual can be found. Checking empirical Bayes residuals is discussed more extensively in Langford and Lewis (1998, Section 2.4). Residuals and influence ot level two 133 ample 0.3 Level two rial inection Beas 62 yew conined now with he lspecion ofthe eelovo aa etal apse wich Heo ecto 19 (acing the tro felons oi) andthe whoa mano 1, SES, and gender; and aude efit tt@et il Tawa Modal ot Tles 2, aot 1 cecriacty eh ont) Lavo engl Bayes 2 Ta cl rte tr andor dhe spe of 10. Figure 9. ‘putts SSactanariod condi wa foc of dn Hin eve ‘Grable tho wnt inde nthe mode Us wy os: 00 TS 90 25 90 35 class size 15 20 25 30 85 class she Figure 0.4 Leveliwo residuals as function of class size; left residuals for the intercept, right rsidoals fr the 1Q slope. ‘ie ace dons ot how any cont ptm, Thin uppers the ex cane S2Su ca ts tention wit 19, om the ml Speco tne srl pou the waned eda ace a ra ps part the peinate Staley the level two residuals. observed es “2 expected “e012134 Assumptions of the hierarchical linear model 9.6.2 Influence of level-two units ‘The diagnostic value of residuals can be supplemented by socalled ingly. ence diagnostics, which give an indication of the effect of certain parts of the data on the obtained parameter estimates. In this section we present diagnostics to investigate the influence of level-two units. These dlagnosticy show how strongly the parameter etituates are afected if unit jis deleted from the data set. This section is based to a large extent on Lesafre and Verbeke (1988), although we do not precisely follow all thar recommen, dations. Other influence measures for multilevel models may be found in Langford and Lewis (1998, Section 2.3). For the definition of the diagnostics we shall need matrix notation, but readers who do not kaow this notation can base their understanding on the verbal explanations Denote by 7 = (ost) the vector of all parameters ofthe fixed part, consisting of the general intercept and all regression coefficients. Further denote by % the vector of estimates produced when using all data and by p the covariance matrix ofthis veetor of estimates (so the standard erors of the elements of 7 are the square roots ofthe diagonal elements of #) Denote the parameter estimate obtained if leve-two unit jis delete from the data by 4). Then unit j has a large infuence if) difer» much from 5. This diference can be measured on the basis ofthe Covariance mar trix fp, because this matrix indicates the uncertainty that exists anyway inthe elements of . Unit j has a lage impact on the parameter estimates if, for one or more of the individual regression coefcients“p, the difference between % and (py is not much smaller, or even larger, than the standard error of jp, Given that the vector fas a total of r+ 1 elements, & standardized measure of the difference between the estimated fixed effects for the entire data set and the estimates forthe dataset excluding unit j = sven by 3 gc OF = 5 6H) 87 (FF) - This can be interpreted as the average squared deviation between the eatimates with and those without unit j, where the deviations are measured proportional to the standard eros (and aocount i taken of the correlations between the parameter estimates). For example, if there i only one fixed parameter and the value of C?¥ for some unit is 0.5, then leaving out this tunit changes the parameter estimate by a deviation equal to VOB = 0.7 times its standard error, which & quite appreciable. ‘This influence diagncsticis a diret analogue of Cook's distance (treated, 8, in Cook and Weisberg, 1982, and Atkinson, 1985) for linear reresion analysis. If the radom part at level two is empty, then C9 is equal to Cook's distance, It may be quite time-consuming to calculate the estimates +») for ach group j. Since the focus here is on the diagnostic value of the fala. ence statistic rather than on the precise estimation forthe data set without ‘sFoup j, an approximation can be used instead. It was proposed by Pregi- Residuals and influence at level two 135 jmator for ca (1961), n diferent contert, to substitte the onestep estimato a Shc ennai which starts om the eimate obtain’ forthe Fon a Ls a sa ing algorithm. Denoting this one-step estimator by (-) we obtain the fafuence diagno = Fen) 8 Fe) » os) 4 1@ influence of group A simi tuence diagnatic can be deine fo the in co the cinta of the random pa Denote by tbe reste of ll a rameters of the random part and by q the number of elements of tt Paton If the covatace matrix of the random efits is unrestricted, then a, ec rhe rasp owl he eas the tno dines eae Se tSibaly = per dypeih/2+ paramo Nov dee 9s the inate oe rotanee date, Seas the covariance mat fi estima Ghai on ts dngoal the agated andard ero of he dents of 2), Foo ce eaay cnnate when dating evel Un Laat num sat on the preter of th endo par ete by Of = 2-9) BR -Fe-9) es) ‘ the param: Cone can io consider the combined infence of grou jon th ioe he fined and thw ofthe random part afte model, Since the senses and pare approximately uncorrelated (Longford, 187), such @ meine Suess ingnoaticanbe deel simpy a0 he weighted average Gf the two previously dined Gagosti, Gp= phil teh + och) - 7 rea ‘ 7 a Lema and Ven (1908) propose pea inunce again which Clmnly related to those proposed bre Whether soup Sarge infmne onthe paar etinats fr he hte ue depen onto tg Ta it the verge te OUP, eet peta ch go oleae th parameters beaut ze dG sls ft erpsatry vrais i thou. Groupe with a large ssize and with strongly dispersed values of the explanatory ain a faves igh evenge,‘The secu the extent to hic thin group st Bide enna oy (or eanatd rom) the ther ros, whch clsly ‘ised tote elas the eid in ha oup.A pool Biting grou That has iow leverage, eg becuse faa Sa, wil ot strongly ast Tima a Veber dpa evry cm oh dei (5) 08), 08 (oriminiel ies aston puna seit Sprott eae Seal pop eet ete i and bea be ct the wel-kzown Sooke distance for the near reresion mse, ‘te See'nep emia canbe convenes culated Winn ove oe afrare based on th Fisher soring or (RJIGES algorithms.136 Assumptions of the hierarchical linear model the parameter estimates, Vice versa, a group with high leverage will not strongly affect the parameter estimates if it has residuals very clase to 0. Ifthe model fits well and the explanatory variables are approximately randomly distributed across the groups, then the expected value of the diagnostics (9.5), (0.6), and (9.7) is oughly proportional to the group size nag A plot ef thee diagnostics asa function ofr wil reveal whether some of the groups infuence the fixed parameter estimates more strongly than should be expected on the basis of group size. The fit of the leveltwo units to the model can be measured by the standardized multivariate residual for each unit. ‘This measure was also proposed by Lesaffre and Verbeke (1998), who cal it the ‘squared length of the residual’ It i defined as follows. The predicted value for observation Yis on the basis ofthe fixed part is given by + Dine. m ‘The multivariate residual for group j is the vector of deviations between the observations and these predicted values, ¥y (08) Yas ~ Pays ‘This mulivaiat residual ean be standardized on the basis of the covariance satrx of the vector of all observations in group j. For the hierarchical Lncer ‘model with one random slope, the elements of this covariance matsix are given by (58) and (5.6). Denote, for a general mocel specification, the covariance matrix of the observations in group j by 3(¥) = Corts) ‘This covariance matrix is a function ofthe parameters ofthe random part of the model. Substituting estimated parameters yields the estimated covas soot at, £(V;). Now the standardized multivariate residual is defined 5} =D, (80%) "Dy. (9.9) ‘This definition can also be applied to modes with heteroscedasicit a ‘one (see Chapter 8). yatleve ‘The stancerdized residual can be interpreted as the sum of squares of the multvarise residual for group j after transforming this multivasiave residual to a vector of uncorrelated elements cach with unit variance. If the model is wvertly specified and the numberof groupe ia Inge excupl for the parameter estimates to be quite precise, then $? has a chi-squared distribution with ny degres of freedom. ‘This distribution can be used to tet the vale off an! thus lneaigate Ge ae oon eee axchical near model as defined by the total dataset. (This proposal was Residuals and influence at level two ast suo by Water, Laid, nd Ware, 180) Some cation sould be at heme beast tobe xed at among i Nr oenigfan es by chance een for 8 Se ee ecm aalians Correctly specified, it may be expected that S} has a significantly large value cr re er en grin endo ae Ov fr two froups. One can apply here the Bonferroni correction for multiple testing, ote One an ED ee serfation doth ithe salt pvahe aa rr egakivarte reid ils tan O05), where i Goat number graze Se at eny in comparison withthe eter groups, a oF ae esas onthe tan ofthe eagured dl Tie a ot cing the noel speieaton coe should wory mally a he ntence diagnose Cage compared {0 soak Oe ane thre sep Bt a measure by the cpeidlardised multivariate residual S?. For such groups one should invest standard a yeng pray. For expe thre may Be ata Fa oot aly belong tothe invented population, ‘This investigation may, of course, also lead to a model improvernent. Example 9.4 Influence of leve-two units. - We continas checking the model for the language test score which was also vestigated in Example 9.3. Recall that this s Moda 4of Table 8.2, but with- tut the leye-one beterosadasticity. The orenty largest infiuence diagnostics (0.7) are presented in Table 9.1 together with the p-values ofthe standard ‘multivariate residuals (9.9). ‘Table 0.1 Twenty largest infuence diagnostics. © tat thr sD tom aa SM tos tone 1 Som oa 2 oe en Set me Sh Soa osm 1 2 oe tome Sof ton ca 2 8 0m ue wn hw ome my ont oss ae lowed138 Assumptions of the dierarchical linear model Ta ha with th hg induce hat a good (p= 0.78), Sot 15 and 108 very pot, we mol 8 an 17 heve« moray et fi, "The Boforontcaroznd bound tthe malls Pedue S 009/10 Ones Te two pone ing scl ave palan wale has C01 wich gps th te few aan Tse toe eaten moe dae modal. As caligrce ba he most mporan eto te Ingege ts, ms td torte th ofl aa espe ap Mable 0.2 Mode with a more deta cto 10, Beak eis se ‘Tatercept 33.54 O31 aon tim ous Goa tin Ses dor foes oe Soft aes one 3a Ste casa Otte tao Parameter SE. Tesebtwo rondow CPt Iotercapt variance T4119 1Q slope variance 00% oes Intercept “IQ lope comrance 0.70.28 Igvc-one variance of constant tom 00 Deviance a3, ‘An important viable in the dat st, not eed in the earier examples, is performal 1Q. (The IQ variable used up to now is a measure of verbal 1Q, 4 natural first IQ dimension for explaining language achievement. Performal 1Q san important second 1Q mension) Performa 1Q in this data ot han toten O and standad deviation 220. Exploration of some interactions led fo considering also the intretion of verbal IQ with gender ad with SES. Inching thse fxd effects inthe model ld to the parameter exinatey i ‘Table 92, These extinaten show thst performal IQ bat aa eft eddtonal to thera 1 ad ta 1 baa wae cringe ave A higher socioeconomic status and also fr gl. Tae twenty large inence Aingnosties for this model ace preted in Table 93 "The thee large: influence dagnowtia bave become sarledly smaller compared to Table 91. Hove tbe Sof mol 10 has farther deri Fated, aed i siguifcasty bad eves wien taking into conidction (hy the Bonfen! comet) tat tha ste puta fa ealecior ok sctools. This poor ft & not too alarming, horevr, asthe ample sce for this schol is small (ny = 0) and ite infeen dagucntic is noe very large (Cj =0018) compared to C) vals of eter schoo, (The same apples, by {be way, to the diagnos presented in Table 9.1) A ook atthe dats for school 108 reveal that the prodicted values Yi for this school are quite homogeneous and close to the average, but two of More general distributional assumptions 139 ‘Table 0.3 Twenty largest influence diagnostics for extended model. Baool my Gy T7936 0055 008 ior 17 0086 00a2 1OL 23 0032 0.08 ot 2% 00s OTs 32-21 0000 0088 1 2% 0026 0209 wal 10 ome O16 as 00m 08 1h 24 Gaal 0.08 o oa 0176 35 13 8 Goal O0o16 3 9) OI oss 2% 10 ols 0399 ms 21 0018 O70 ios 0018 0.000082 2% a4 0017 0218, je 16 0017 47s us % O04 0609 i (9 0013 Osa 22300130580, ‘he nine outcome values Yi are very low and have, correspondingly, very low residuals, ~21.8 and ~28.6. When theve two pupils are deleted from the dat Set, the parameter estimates hardly change, but the lowest p-value for the andardieed multivariate residuals becomes 0.0014, which i not significant ‘hea the Bonferroni correction is applied. Concluding, the investigation of {he influence statistics led to an improved fit due tothe inclusion of performal 10 and the interactions of 1Q with SES and with gender, and to the detection of two deviant cases in the data set. These two cases out ofa total of 2,287, however, did not have any noticeable infuence on the parameter estimates 9.7. More general distributional assumptions If it is suspected that the normality and homoscedasticity assumptions of the hierarchical linear model are not satisfied, one could employ methods ‘based on les restrictive model assumptions. For example, one could assume that the residuals have a distribution with more heavy tails than the nor~ mal distribution, or that residual variances vary randomly between leveltwo ‘units, This is an area of active research. E.g., Verbeke and Lesaffre (1997) propose a ‘sandwich’ modification to make the standard errors of the nor~ ‘mal theory estimators applicable also to non-normaly distributed random fects; Richardson (1997) derives robust estimators; Seltzer (1993), Setzer, Wong, and Bryk (1996), and Kasim and Raudenbush (1998) propose estimators based on Gibbs sampling, Several multilevel computer programas provide robust (‘sandwich’) standard errors, which can be used if the random effects are not normally distributed and the sample size is fairly large.10 Designing Multilevel Studies Up to now it was assumed that the researcher wishes to test interest theories om hierarchically structured systems (or phenomena that can be thought of as having a hierarchical structure, such as repeated data) on avallable data. Or that multilevel data exist, and that ove wishes to explore the structure ofthe data. This, of course, is the other way around. Normally ‘a theory (or a practical problem that has to be investigated) will direct the design of the study and the data to be collected. This chapter focuses on ‘one aspect of this research design, namely, the sample sizes. Sample size ‘questions in multilevel studies were treated also by Snijders and Bosker (1993), Mok (1995), and Cohen (1998). Another aspect, the allocation of treatments to subjects or groups, and the gain in precision obtained by including a covariate, is discussed in Raudenbush (1997). Hedeker, Gibbons, and Waternaxx (1999) present methods for sample size determination for Jongitudinal data analysed by multilevel methods. This chapter presents some methods to chooge sample sizes that will yield a high power for testing, or (equivalently) small standard errors for ‘estimating, certain parameters in two-level designs, given financial and practical constraints. A problem in the practical application of these methods is that sample sizes which are optimal, eg., for testing some cross-level interaction effect, are not necessarily optimal, eg., for estimating the intraclass correlation. The fact that optimality depends on one’s objectives, however, is a general problem of life that cannot be solved by this textbook. If one wishes to design a good multilevel study it is advisable to determine first the primary objective of the study, express this objective in a tested or estimated parameter, and then choose sample sizes for which this parameter can be estimated with a small standard error, given financial, statistical, ‘and other practical constraints. Sometimes it is possible to check, in addition, whether also for some other parameters (corresponding to secondary objectives), these sample sizes yield acceptably low standard errors. AA relevant general remark is that the sample size at the highest level is usually the most restrictive element in the design. For example, a two-level design with 10 groups, Le., a macro-level sample size of 10 is at least as un- comfortable as a single-level design with a sample size of 10. Requirements on the sample size at the highest level, for a hierarchical linear model with q explanatory variables at this level, are at least as stringent as requirements on the sample size in a single level design with q explanatory variables. Mo | Some introductory notes on power 4 10.1. Some introductory notes on power When a researcher is designing a multi-stage sampling scheme, e.g., to as sess the effects of schools on the achievement of students, or to test the ‘hypothesis that citizens in impoverished neighborhoods are more often vic~ tims of crime than other citizens, important decisions must be made with respect to the sample sizes at the various levels. For the two-level design in the first example the question may be phrased a3 follows: should one investigate many schools with few students per school or few schools with ‘many students per school? Or, for the second example: should we sample many neighborhoods with only few citizens per neighborhood or many Citizens per neighborhood and only few neighborhoods? In both cases we ‘assume, of course, that there are budgetary constraints for the research to be conducted. To phrase this question more generally and more precisely: hhow should researchers choose sample sizes at the macro- and micro-level in order to ensure a desired level of power given a relevant (hypothesized) cffect size and a chosen significance level? The average micro-level sample ‘ze per macro-level unit will be denoted by m and the macro-level sample size by N. In practice the size of the macro-level units will usually be variable (if it were only for unintentionally missing data), but for calculations of desired sample sizes it normally is adequate to use approximations based fon the assumptions of constant ‘group’ sizes ‘A general introduction to power analysis can be found in the standard work by Cohen (1988), 0, for a quick introduction, Cohen's (1992) power primer. The basic idea is that we would like to find support for a research hypothesis (H)) stating that a certain effect exists, and therefore we test & rill hypothesis about the absence of this effect (Ho) using a sample from tthe population of interest. ‘The significance level a represents the risk of mistakenly rejecting Ho. This mistake is known as a Type I error. Vice verse, denotes the risk of disappointingly aot rejecting Ho, in the case that the effect does exist in the population. ‘This mistake is known as a ‘Type Il error. The statistical power ofa significance testis the probability of rejecting Ho given the effect size in the population, the significance level fa, and the sample sve and study design. Power is therefore given by 1— 8. ‘As a rule of thumb, Cohen suggests that power is moderate when it is (0.50 and high when it is 2t least 0.80. Power increases as o increases, and also as the sample size and/or the effect size increase, The effect size can be conceived as the researcher's idea about ‘the degree to which the ull hypothesis is believed to be false’ (Cohen, 1992, p. 156). We suppose that the effect size is expressed by some parameter y that can be estimated with a certain standard error, denoted by 5.E.(7). Bear in ‘mind that the size of the standard error iss monotone decreasing function of the sample size: the larger the sample size the smaller the standard error! In most single-level designs, the acandard errer of estimation is inversely proportional (or roughly so) to the square root of sample size. ‘The relation between elect size, power, significance level, and samplea Designing multtevel studies size can be presented in one formila. This formula isan approximation that is vali for practical ise when the test in quston isa oe sed tek for 7 wth a reasonably large numberof degrees of redo oy, df Ion Recall that the tet statistic for che test can be expresed by hin, #=4/S.E.(5). The formula is fet size andard eror © (1-2 + 4-0) a» Zing, nd tp are the z-scores (values from the standard normal distribution) astociated with the Indicated cumulative probability veka, Mor instance, a ix chosen at 006 and I~ 9 at O80 (o that B= Oey and an fet sie of 0.50 is what we expect, thea we ca derive that wey searching fora minimam sample size that satisfin re) standard enor < 7508 Formula (101) contains four ‘unknowns’. This meaas that if three of these are given, then we can compute the fourth, Ia most spplcetoe that we have it mind, the significance level ais given ana the that oe 's hypothetically considered (or guesad) to have & given vag cic Standard eror i also known and the power 1 - is caleuled, os thn intended power is known and the standard error calculated. Given the Standard error, we can then ty to calculate the required sample sine For many types of dxga one can chooee the sample sz ttseary to achieve certala level of power on the bass of Coben's work, for ayes Assigns, however, there are two kinds of sample sie: the sample sue of to, ticro-units within each macro-unit n and the ample size ofthe smacroronts 1N, with Nm being the total sample size for the micro unite, 10.2 Estimating a population mean ‘The smost simple case of a multilevel study occurs when one wishes toe fimate a population mean for a certain variable of interest (eg, income, ge, literacy), and one is willing to use the fact that the respondents are regionally clustered. This makes sense, since if one is interviewing persons itis a lot cheaper to sample, let us say, 100 regions and then interviewing 10 persons per region, than to randomly sample 1,000 persons that may live scattered all over a country. In educational assessment studies itis of course ‘more cost-efficient to sample number of schools and then to take a sub. sample of students within each school, than to sample students completely at random (ignoring their being clustered in schools), Cochran (1977, Chapter 9) provides formulae to calculate desired sample sizes in case of such two-stage sampling. On p. 242, he defines the design effect for a two-stage sample, which is the factor by which the variance of an estimate (which is the square of the standard error ofthis estimate) is ‘increased because of using a two-stage sample rather than a simple random ita pup ainda Neghbarbot. ‘A two-level model with a crossed random factor 187 ‘The random neighborhood effect is rewritten as Was) = DW bo « ‘This complicated looking notation i chosen because it does away with the va son JG), which eaanot be handled direty inthe maltlevel frame monet Squation (11.2) reduces the crosed random effect to random slopes timmy vatables by to Be. However, these random slopes are not nested ave unite at lve one oF two, The trick now isthe creation of an ex in tes athe third, consisting of only one wit (encompassing therefore the reso data set). This may well be called a dammy level oz pseudo level srho' my variables by are piven uncorcelated random slopes at this this Jerel and the slope variances must satisfy the restriction var(W) = var) va(Wr) 1s) “These ingredients ae scent to model the crosed random eects within the framework ofthe hierarchical \inear model iw car which requirements are made to multilevel software for mod- cing saadom eects in this way. An extra level with only ove wnit must eafiowed, and it rust be possible to make equality restrictions between Peadom slopes. This s possible, eg. in MLn/MLwN.* ‘e'peectical requirement i that the software can accommodate the re aquired number of F random slopes. In practic, this wil et limits to the acebility of incorporating random ects, In model (111), the roles Of Pemsols j and neighborhoods f could be interchange’. pupils then would De wearer cichin neighborhoods, and these would be crosed with seaocls. Both eientations are equivalent. Which one to choos, isa matter of conve, saree Tt wil usvally be most eficient for multilevel software to chonse the cence th the larger number of units as the neting factor, and the factor wrth the fewer urls the crossed factor. Phas cuthod ean be extended to situations where pupils may have lived part ofthe time in one, and part ofthe time in anther neighborhood. (Or, (1a) aoe than one category of the crossed random factor.) This is done by “alining the variable by not by 0-1 values as above, but by positive values Summing to 1, indicating the fraction of time in which they belonged 0 ‘neighborhood f, see Wil and Goldstein (1998). Bxample 12.1. Sustained primary schook ofects. Fanaa ith the Belgian dataset of Opdenablr ana Vas Damine (1997) ‘which was used also in the examples in Sections 4.8 and 5.5. visnenst alan contain information on tbe specific primary school at- ence ty the dente belore they went to secondary edacation, In total 376 “Sieret penary school were atrended by the 3752 sdents, whereas ony eee dary achoss were involved. In order to find out whether the effect __o the pinay sol ated bo eae’ cect on the mathematics Tis prog! ao coutelae the eptial command SET for dening x model with cromed random fect158 Crossed random coefficients ss un eet ty Scat cuter oats sete maton siaerumey wear Se So Snawes Sess ses Semaine enemas ioe Dna revere aeg, ee iat rp daw ing wie bites acca Rar nd Scan ee ‘Teble 11.1 Two crossclasified models for math achievement. Model Model 2 Fae ee rt a 1 leeept 198 aa — nao ni dose 00s 3 Nation fost bas 7% Pater tucton Sas nis 1 Cone asst Shi anion Wet Vie Comp vs ste SH Va. Comp. 5 msl) oetadary wcoal 259007 Fate random effect: om on Garey) prima stool OAT OTS wt) ors 0x8 asp Free sie: ont? 6am ase Devise senses se Bois sen ares From the ros able 1. Mol, ca be st the ry athens ste 708 and ie tol wre a Nk bene Pe of yan 12 paca uae wit he towny he coun adam ctr Only pce of the races oles el oe ary mone hth age ten, fee mats hte theese ich ceo ow ety ase at ‘Beynon sede retinas ened seosoay cheae Gteouse eco arin int tnaprtonstonad arenes binary snl thr ss wes) tt mf reas os ele titer td background air ae wala be Wo el? ce See fan nian wh tn wah inti ev at pera adlerenest int 1, gy acl ee ect vb scouted te malin htm te eet a ‘cha steed ay be tre! ate nate’ eal ete ee toe grovel he nary soo ad the eso Se coniny a tay be imepted she ive uae by ths veo thet peter aw fe ode he tal ue expat rune dc fo 7271 Tae wal fe wens oe 9 per ofthe rol nic in athena ay, seetagh hea ‘See at par wel vel es not haemo ees Whe pe lvl oth oaaday scl dey pane ek Crossed random effects in three-lenel models 159 icates that the secondary schools inthis Belgian sarrple are highly selective, some schools having the more advanced and privileged groupe of students (igh 10, high pretest-scare, high SES etc), whereas others serve more dis- advantaged groups, Stil, however, the secondary schools differ markedly in thei value added, ince 10 percent of the residual vavance mill resides a: the Secondary school lovel. The fcr that che variauve associated with secondary schools i so much less in Model 2 chan in Model 1 leads to the conclasion that Some secondary schools ‘pick’ the better students from the primary schools, ‘werens other secondary schools have to deal with the other students. For the ‘udents it seems not to matter much fr their mathematics ability which spe ‘Bc primary schoo! they have attended: the unmeasured effect is ail rather seal. 11.2 Crossed random effects in three-level models ‘A crossed random effect in a three-level model can occur in two ways. An example ofthe fist is pupils nested in classes nested in schools, with neigh- ‘borhoods as a crossed eflect. In this case the ext factor is crossed with the level-three units: neighborhoods are crosted with schools. The random effects of neighborhoods is modeled jut ike inthe two-level case: an extra ‘dummy’ level is created, but now itis the fourth level; dummy variables are defined that indiate, for each neighborhood, whether the pupil Lives in this neighborhood; these dummy variables get zandom slopes at level 4, ‘with the restriction of equal slope variances. ‘The second kind of random elect in a three-level model is exemplified by pupils nested in schools nested in towns, again with neighborboods as cromed effects, Here the esta factor is crossed wih the level-two units ond rested in the leve-three units: neighborhoods are crossed with schools and nested in towns. If pupils are indicated by i, schools by j, and towns by Fe, the effect of neighborhood f in town & should be denoted Wy, ‘The tual assumptions are made again (normal distributions with mean 0 and a ‘common variance 7), independence). Again, dummy varlables by need to be created. They are defined by 1 if pupil in school j in town ygn = 4 lives in neighborhood a4) ©. if this pupil lives in diferent neighborhood. How many of these dummy variables are there? Denote by F the number of neighborhoods in town k, and their maximum by Froax = gx Fh ‘The number of dummy variables then is Fas- Since neighborhoods are nested withlu towus, there is no need here to create an extra level to accommodate the crosed random effect. The dummy variables by, for f = 1y-.-yFnaxs must have uncorrelated random Slopes at level three, with slope variances constrained to be equal. Each Single town is modeled just ike the whole data set of Section 11.1. In other (as)160 Grossed random coefficients words, just as the dataset in the present section is a juxtaposition of towns, ‘he model is the juxtaposition of the models described in Section 11.1 ‘Since the number of random slopes in this model is not the total number of neighborhoods in the data set, but the maximum of the number of neigh- orhoods in each town, this model economizes on the number of random slopes, compared to Section 11.1. Suppose now that one has a data set of pupils in schools and neighborhoods, these being nested in towns, but one is not interested in town effects or there are no random town effects. Then it still can be advisable to use towns as a third level, because this leads to ‘8 much smaller number of random slopes and thereby to 2 model that is ‘much more efficient given the constraints of existing software. It might even be impossible to implement the large number of random slopes, unless one uses the towns as a nesting factor, 11.3. Correlated random coefficients of crossed factors ‘The principle of modeling crossed random effects, explained in Section 11.1, can be extended in many ways. This section treats three such variations with correlated random coefficients of crossed factors. 11.3.1 Random slopes in a crossed design ‘The neighborhoods in Section 11.1 could have random slopes in addition to their random effects. The notation of that section is continued here. If neighborhood f has a random intercept, and also random slope for the variable X (denoted here as a level-one variable, but allowed also to be a Jevel-two variable), the total contribution of neighborhood f to the random partis Wor + Wiszy (ue) For the neighborhood of pupil in schoo! j, this i Wass) + Wa.pes) 245 - an With the dummy variables by, this ie rewritten as Loy bys + Was byes 24) (1.8) mi ‘This shows that, in addition to the dummy variable Jy, we now need the product variables byX with values by ty (19) ‘To implement the random part (11.8), one gives random slopes to the variables by and by X. Their variances are denoted rfp and ry. For diferent values of f, these random slopes are uncorrelated. The slopes of by and 8, X for each single value f, however, must be assumed to be correlated ‘Theie covariance is denoted rye. The assumption (11.3) about equal slope variances now must be extend! to Correlated random coefficients of crossed factors 161 vyar(Wo1) = vax(Won) var(Wor) » var(Waz) = var(Waa) =... = var(Wir) » (1.10) cov(Wors Win) = cov Wea, Wia) = (Wor, Wir) - ‘The interpretation of the random slope of the crossed factor is similar to the interpretation in other random slope models. oe 24 sg ts i fgets iw on eg a Bhat onirg ome ee es aca cry thee prone teaces a group of students, and examines a subset of dae pr ne ae ce ince ke a mm at i om vim et rn is a i os wal agen J ot es inne a ee a ceca er oe eS pte gee Sn rt rl te ok Se ind es tags Fee roman ee niet a ie Rae, rae oy nae snot no a, et eer ai ce Sage to teem ea tc dS a moto soa hc Ne See ee var all be no surprise that, inthis case, two sets of dummy variables are needed. For the teaching roles there are dummy variables bi,...,br with ee Pe Wn bn 0 if this student was taught by another person. ‘Dummy variables c;,...,¢p for the examining roles are defined by ee eee 5 Et ed ae ea Wosiey + Wats» (3) can now be written as Loves bys + Wiens) aaa) fa ‘The dummy variables b, and cy have random slopes at level two, that are correlated for the same f and uncorrelated for different J. Restrictions must162 Crossed random coefficients be made onthe varances and covariances of thee random slopes, name ‘exactly those given also in (11.10). ae * en soy ft ma gm mg sr ce ate pel aso Re doe rept gece banat ee ‘The most usual kind of relationship considered in social network analysis is binary (e.g., being a friend versus not being a friend), cf. Wasserman and 2 ga de al Sea cates a characteristic (e., the strength) of the directed relationship from Lia Fy camo ot i cen Seas ean fe oe elo are avail ly o illowing model was disct 2 and Kenny (1999). aa Geos by Sales ne a ee eee ee pre Li ete ceeenee ceetote ete eemtacie aoe eee . stew Ss an aa mt tandom effects model for tinuous les, these effects ar eee ‘cont variables, the ‘are implied by a risa mie, 6 tpi 2 Sy ct ele dy ee 1 cal te ade andthe reer of deel ony All fess areas’ tobe independent exe that Uy = Uy and that the coe ee yo Ut Like in the preceding subsection, each actor has two eects inthis for- sas teres nen eer es a ae ee Te see ane eer ce ee nooe Se rahe Sica seen name ine the s ‘The dyads are numbered arbitrarily by j running covets goes scone ot rae goes tee eee, SOL cence with outcomes denoted by Z,;. {morere structure se Ee ad unt reoas nde Correlated random coefficients of crossed factors 163 ceiver of each relationship: syij = 1 if person f is the sender of the relation hip expressed by variable Zi, and O otherwise; rij = 1 iff is the receiver Gf relationship Zj,, and 0 otherwise. Note that this implies that ais =F 723 and sy2j = p3- : ‘Recasting the reciprocity effect Uy = Uyy for dyad jin the notation Uj, and recasting the residual Ry, as Ras, model (11.15) can be reformulated Bag = a+ DAs opis + Byryss) + Us + Fey (1.16) ras 1n other words, the sender, or outgoingness, effet is the random slope of the dummy variables #7; the receiver, or popularity effet, is the random flope of the dummy variables ry; and the reiprocty effect is the random intercept at level two. Tk is again necessary to make restrictions on the srarianoes and covariances of the random effects. Covariances between the ects for different persons f and fare assumed to be zero. Further, the slopes for all persons have the same variances and covariances: var(Ai) = var(da) =. = var(Ar) + var(B)) = var(Bs) =... = var(Be) » cor(Aiy Bi) = cov( Aa, Ba) = «.- = cov(Ar, Bp) « It is clear that, if variables associated to the sender, the receiver, or the (directed) pair (/,9) are available, Unese can be used as variables in the fixed part of (1116), ef. Snijders and Kenny (1990). There may, of course, be more fects in the social network than the fects expressed in (11.16); eg. trassitvity effects ('a friend of my friend is my friend’) or subgroup tffocts, If such eects are important and cannot be modeled by fixed effects tf available covariates, the models of this section are not adequate. quan Example 11.2 Communication between high schoolteachers. In a ady of relations between high school teachers, Hey! (1996) asked teach- fers in 1? schools about theic communication on a number of topics. There Srere 195 responding teachers, nested within schools, and forming 1.044 dyads Of the many dimeasions of contacts and the various covariates investigated, swe present here the example of communication about individual pupils and {The affect of the teachers? gender on this communication, Frequency of com- ‘munication between teachers was reported by the teachers on a S-point scale, ranging from 1 = ‘less tran 4 times per year’ to 5 ~ “(almost) daly’ “The model i ie indicated above, but the schools are also a nesting factor. ‘The schools define level three, and the randors effect of senders and receivers are nested within schools, This means that there is no (fourth) dummy level the number of dummy veriables for seoders and recevers is the maximum fbumber of teachers in a single school (cf. formula (11.)), and these dummy Nariables bave random effects st level three. ‘The parameter estimates of ‘uodel without explacatory variables (the ‘empty mode’) are reported. 38 Model 1 in Table 11.2 ‘Except forthe school intercept variance, all vasiance components are larg® and many times larger than their standard errors. The largest variance com ponent is ofthe senders: teachers difer strongly inthe equency of communi164 Grossed random coefficients ‘Table 11.2 Estimates for two social network models. Model 1 Model 2 Coaiiceat SE Cotten —5E, 200 wir -om al “008 Oot Similar gender (F-8) om Ole Random Efect Vas. Comp. SB. Var. Comp. _ S.E “Tasebhvee rondom eis Schoo! intercept oor 0030s Sender variance a7 00s a2 Receiver variance oi ome Senderreceiver covariance 12,003 8.09 O02 Levelstwo random eft Reciprocity a a Level-ona variance: Resa 033 ol 0ss ot Deviance ra90 mart ‘ation which they report. The reciprocity component also is important: when teacher { reports muck communication with teacher g, then itis Hkely that, conversely, teacher g also reports much communication with f. The correla, tion between sender and receiver effects is p(y,B/) = 0.12/YDAT ROE O44, a considerable value. ‘This means that tenchers who report that they ‘communicate much with others are reported about by others in a correspond: ing way. ‘The model with fixed effects of teacher's gender is reported as Model 2 ‘The sender's gender, the receiver's gender, and the similarity of these two, all are potentially relevant explanatory variables. These are represented by ‘dummy variables, where the male category is the reference, and where the similarity variable is 1 only if sender as well a receiver are female, and 0 otherwise. The parameter estimates indicate that only the similarity eect 's significant: the combination of two female teachers lads to a considerably greater frequency of communication than all tee other gender combinations, This gender effect explains part of the variances of the sender, the receiver, and the reciprocity effects. Undirected relations Social network data sometimes are symmetric by nature. Continuing the example above, this is the case if the liking between persons f and g is expressed by an outside observer who indicates the value ofthe mutual liking between the persons; or if, instead of liking, Yjg expresses the objective amount of joint activity by the two persons. In this case the distinction between sender and receiver vanishes, together with the reciprocity effect. What remains of model (11.15) is Yiu Ay + Ay t Ry (18) Correlated random coefficients of crossed factors 165 Jprocity effect. itis euperfiuous to use a second level for the reciprocity effec "Ronee model suficen, ke inthe case of multiple roles. The pairs (oa) for f <9 are arbitrarily renumbered as i = 1,...,F(F ~ 1)/2, and the observations Y7, recast as Z,. Define the dummy variables ay for f = Tenby a 1 if person f is involved in pair i tus) 0. if person f isnot involved in this pais. ‘This definition leads to the reformulation of (11.18) as Yet DAs t Re ms eof the duminy variables, and the “The person effect Ay isthe random slope oft the constraint is that these person effects are uncorrelated and have a comm: variance (11.20)12 Longitudinal Data et ee ee even mr meen ceca iar Sg Smee Saraee rene tog ine ere ease gol wing fie iy autem ease Lace neh ce monet se neers reece nt nl ne Ta or coe Taaae on net ten sur ha eee Seine eat ee at ircneia a Sars Or he advastage fh sca! Una mol pponch to repeated measures one deere tobe meatoned here. I's the felt to del with anblanced ats srt, peated menue dea wth Eel an surement occasions where the data for some (or all) individuals is plete, or longitudinal data where some or even all i cased at different sets of time points. earn ‘The sublet of ropeated measures anal it too vst tobe treated in ge chapter, Fora more extensive treatment of ths topic we refer to, Maxwell and Delaney (1990) or Crowder and Hand (1990). This chapter only explains the basic hierarchical linear model formulation of ‘models for repeated measures. In economics this type of model is discussed mainly under te beading of pane date, og, Baa! (1096), Chow (1984), and ___ This chapter is about the two-level structure of measurements wi indivdils When the individu tn thet tum are mated we ree te data havea three evel structure: longitudinal measurements nested within ins eed wi ro. Moo rc at star abe bined by aig the group lvl sa tad love othe mod oh Ghaptes. Shh tree model ae ot exlty teed at cape ees ‘ais three-level extension of the fully multivariate model of Section 35 te same athe mulivarat malilevel model of Chapter 13 16 | | | Fised occasions 167 12.1 Fixed occasions In fixed occasion designs, there is a fixed set ¢ = 1,...m of measurement ‘occasions. This could be, ein a study of some educational or therapeutic, program, an intake test, pretest, mid-program test, post-test, and follow- Qp test (mm = 5). Another example is a study of attitude duauye in early ‘adulthood, with attitudes measured shortly after each birthday from the {eth to the 25th year of age. If data are complete, each individual has provided information on all these occasions. It is quite common, however, that data are incomplete. When they are, we assume that the absent data fare missing at random, and the fact that they ace missing does not itself provide relevant information about the studied phenomena. "The measurement occasions are denoted t = 1,..,™m, but each individual may have a smaller number of measurements because of missing data. Yi ‘denotes the measurement for individual é at occasion t. It is allowed that, for any individual, some measurements are missing. Even individuals with conly one measurement do mot need to be deleted from the dataset, although they contribute, of course, only little information ‘Note that, differently from the other chapten, the level-one units now are the measurement occasions indexed by ¢, while the leveltwo units are the individuals indexed by ‘The models treated in this section differ primarily with respect to the random part. There are three kinds of motive for choosing between these models. The first motive is that if one works with a random part that is not an adequate description of the dependence between the m measurements (Le, it does not satisfactorily represent the covariance matrix of these measurements), then the standard errors for estimated coefficients in the fixed part are not reliable. Hence, the tests for these coeficients also are un- fellable. This motive points to the least restrictive model, Le., the fully multivariate model, as the preferred one "The second motive is the interpretation of the random part. Simpler ‘models often allow a nicer and easier interpretation than more complicated ‘models. This point of view favors the more restrictive models, e.g. the random intercept (compound symmetry) and random slope models. "The third motive is the possibility to get results at all and the desirabll- ity of standard errors that are not unnecessarily large (while not having @ bias at the same time), Ifthe amount of data is relatively small, the estima- ton algorithms may not converge for models that have a large number of parameters, such as the fully multivariate model, Or even if the algorithm Goes converge, having a bad ‘data-to-parameter ratio! may lead to large standard errors. This suggests that, if one has 2 relatively small data set, fue should uot eulectain models with too many parameters. Concluding, one normally should use the simplest model for the random part (ie., the model having the least parameters and being the most re- stritive) that still yields a good fit to the data. The model for the random part can be selected with the aid of deviance tess (see Section 6.2)168 Longitudinal data 12.41 The ompound emery mode ‘he cana sol br epesed momar, ald the our call the compound ode Gg, Morwel en Dele, 100) th sae asthe rand ner Gopal Ws omens wed ts the nam cts sol on oa ecemg emia ic a thesany potion random ene noel crotia! sae ican to a ecannloy ssabin apt orth neesenae ey ee Shin i pate wre dri the pean rn or meron cern fan doce ad mod Sa i ee You = be + Use + Res» (2a) “on ial espn ae made the Uy and a senda Gitited random in wih cpctom'D ced eee and o? for Rei. ms ae fos Tot od note hatte fan pr doe wt cots can yb be mi rt mate se Tl can be expressed by the following formula. Let days be m a defined for h ym by, 7 arenin me ares 0 tth (122) ‘Then the fixed part 4; in (12.1) can be written as Be= 2 nds and the compound symmetry model can be formulated as Yu= Sound + Ua + Racy 023) the ual form ofthe ied part o hierarchic! lnear mode Example 121, Dewipmen of teachers! nation rekelmans and Créton (1995) made a study of the development overtime of valuations of teachers by thei pupils. Starting from the first yar of thet teaching career, teachers were evaluated on thet interpersonal behavior inthe classroom. This happened repeatedly, at intervals of aboat ote year. In this cxample, results are presented about the ‘proximity’ dimension, represent the degre of cooperation or closeness between teacher and his or her sts dents. ‘The higher the proximity acore of teacher, the more cooperation is pareived by his or her students ‘There ate four measurement occasions: after 0, 1,2, and 3 . after 0,1, 2, and 3 years of ex. perience. Thus, the time variable ¢ assumes the values 0 throug 3. A total Of 81 teachers was studied. The number of observations for the 4 moments decreased fom 46 at t = 0 to.32 at ¢ = 3, ‘The non-response af various ‘moments may be considered to be random. First two models are considered with a random intercept only, i, with a compound symmetry structure forthe covariance matrix. The frat isthe empty model of Chapter 4. In this mocel iis assumed that the 4 measurement occasions bave the same population mean, The second model is model (12.1), Fized occasions 169 which allows the means to vary feely over time. In this ease, writing out (423) results in the model You = pr dies + sadays + pada + pe des + Ua + Res ‘Applying deBunition (12.2) implies that the expected score for an individual, eg, at time f= 2, 8 Yu =m xO + pax 1 + os xO + Ba ba ‘Table 12.1 Estimates for random intercept models. Model 1 Model 2 Se crit — 5 ceicieat SE ee yaaa ued SCO OS fm Mean at tine 1 Des 00s © OI 067 ty Mean at ine 2 3 00) 0620087 fs Meso atte 3 asi 00s) 0839 9.070 Random Eifect Parameter SE. Parameter SB. [Ritmo fees ielintaal) ara pa varor) cia. oom = ont om9 net one (te, occasion) seianc: ohana) ov oo © 0am =o Deviance 12339 835 “The results aze in Table 12.1. It is dangerous to trust results for the compound symmetry model before its asomptions have been tented, because standard ‘rors of xed effects may be incorrect if one uses a model with a random part that bas an unsatisfactory St. Later we wil fit more complicated models and Show bow the assumption of compound symmetry can be tested. "Phe results suggest that individual (level two) variation is more important than random diferences between occasions (level one variation). Further, the near seema to Increase fom time t = 0 to thme t = 1 and then to decrease ‘again, However, the deviance tet forthe difference in mean between the four tame points not sguiicant: x? = 12339-11885 = 5.04, df. = 3,p > 0.10. Just like in our treatment of the random intercept model in Chap- ter 4, any number of relevant explanatory variables can be included in the fixed part. Often there are individual-dependent explanatory variables Zq, k= 1st & traits of background characteristics. If these vari: ‘ables are categorical, they can be represented by dummy variables. Such individual-dependent variables are level-two variables in the multilevel approach, and they are called between-subjects variables in the terminology of repeated measures analysis. In addition, there often are one or more ‘numerical variables describing the measurement occasion. Denote such @ variable by a(2); this can be, eg., a measure for the time elapsing bet ween the occasions, or the rank order of the occasion. Such a variable is called within-subjects variable. In addition to the between-subjects and within- subjects variables having main effects, they may have interaction effects with each other. These within-between interactions are a kind of cross-level170 Longitudinal data Interactions in the multilevel terminology. Substantive int taraes analy cen focte on thee iteractcas = PHtad When the main efect parameter of zis dented 24 is denoted 24 and the interact ‘fect parameter between zx and s(t) is denoted 7g, the model 's given by. Yo = Dlonew + Sores alt) + Do andaw + Vos + Ru (124) he Gxed partis an extension of (121) oft ension ot the equivalent form (12. ‘he random par si the sme. Thrdore th ileal! a Goopoeet symmetry model. Inclusion of this kind of variables in the fixed part su fests, however, thatthe random part coud aso contain random slop of time variable such as o(). Therefore we defer giving an example oft feed part (124) unl ater the treatment of such etantom spe Cail ana of variance methods are avalabe to estimate and text parameters ofthe compound symmetry model if all data are complete (eg Maxwell and Delaney, 1990) The herarcical near model formulation cf ths model andthe algorith and otvare arable, permit the alsa erlen of for incomplete data without any additional Covariance matriz In the fixed occasion design one can talk about the complete data vector Even if there would be no subject at all with complete plete data vector wotld take ease om a cocrpia pt of ven dug Tn comPound symsntry model (121), (123) or (124) implies that for the complete data vector, all variances aze equal and also all covariances are equal. The expression for the covariance matrix of the complete dat vector, conditional on the explanatory variable, isthe m xm matrix Boo? Bf : godt a nyy=|% 3s 5 (25) ado : Reo pM. This mate refered te amped aymaty comic tat. Ta thi covatiace mati alr vaanes ae ne all residual within-subject correlations are equal to wre the 7 +e B= ALY Ya} > (t#a) 26) Fized occasions am the residual intraclass correlation which we encountered already in Chap- ter 4. ‘The compound syrametry model is a very restrictive model, and of: ten an unlikely one. For example, if measurements are ordered in time, the correlation often is larger betwon nearby measurements than between (measurements that are far apart. (Only for m = 2 this condition is not fo very restrictive. In this case, formula (12.5) only means that the two measurements have the same variance and a positive correlation.) Example 12.2 Covariance matric for teachers’ exaluations {a Table 12:1 the estimates o* = 0.074 and 7G = 0.121 wore obtained. This fmples that the measurement variances are 0.121 +0.074 = 0.195 and the within-subjects corelations are p; = 0.121/0.195 = 0.621. This number, ‘Dowever, is conditional on the valiity of the compound symmetry model, we sll soc below that the compound symmetry model does not ft very well to these data. 18.1.2 Random slopes ‘There are various ways in which the assumption of compound symmetry (hich states that the covariance matrix has constant variances and also constant covariances, cf. (12.5), can be relaxed. In the hierarchical Linear model framework, the simplest way isto include one or more random slopes in the mode. This makes sene if there is some meaningful dimension, such as time, underlying the measurement occasions. Tt will be assumed here that the index ¢ used to denote the measurement occasions is a meaningful trumerical variable, and that itis relevant to consider the regression of ¥ on. {This variable wil be referred to as ‘time’. It is exay to modify the notation So that Some otter numerical function of the measurement occasion ¢ gets a random slope. Itis assumed further that there is some meaningful reference Yale for, denoted fo. This could refer, eg, to one ofthe time poiats, such as the first. The choice of ty affects only the parameter interpretation, not the fit of the model, Since the focus now is on the random rather than on the fixed part, the precise formulation of the fixed partis left out ofthe formulae "The model with ¢ random Satercept and a random slope for ¢ is given by Yu = fixed part + Us + Uis(t—to) + Ra (27) ‘This model means that the rates of increase have a random, individual dependent component Uys, in addition to the individual-dependent random Geviations Uy, which affect all values Yin the same way. The random effect of time can also be described as a random time-by-individval interaction. ‘The value fo is subtracted from ¢ in order to lt the intercept variance refer not to the (possibly meaningless) value t = 0 but to the reference point ¢ = foyef p. 69. The variables (au, Vis) ate assumed to have a joint bivariate normal distribution with expectations O, variances rf and r?, and covariance 7172 Longitudinal date ‘The variances and covariances of the measurements Yq, conditional on the explanatory variables, are given now by var(¥i) = 13 + 210 (to) + TP (to)? + 07, (128) cov (Yass Yes) = 79 ++ Tor {(t to) + (2 —to)} + 1? (t—to)(s — to) , where t # #. These formulae express the fact that the variances and co. variances of the outcome variables are variable over time. This is called heteroscedasticity. The variance is minimal at t = to ~ mu/7? (it would be allowed to assume any value). Further, the correlation between diferent ‘measurements depends on their spacing (as wel as on their postion). Extensions to more than one random slope are obvious; eg., a second random slope could be given to the squared value (¢ f9)*. In this way, fone can perform a polynomial trend analysis to improve the Bt ofthe ran dom part, This means that one fits random slopes for a number of pow. cars of (€~ te) to obtain a model that has a good fit to the data and ‘where unexplained differences between individuals are represented as random individual-dependent regressions of Y on (¢~ to), (ft), (¢~ &)®, te, Other functions than polynomials can also be used, eg, splines (see Section 12.2.2). Polynomial trend analysis is discussed also in Maas and Snijders (1999) Example 12.3. Random slpe of tne in teachers’ evaluations CContsning the carr erampleof teachers’ valuations, ote tat the obac- vations are spaced a year apart, and a natral variable for the ocatons is X() =, the ocasion number, This vatiabl a toe dnnsin which counts the numberof years of experience and will be refered tw ay experenes A random slope of experince now is added (Model 3). The eelerance vale for the time dimensos, daooted ty i formula (127), i tales ag f= Oy cote sponding to the fret menrurement oceaion, where the teacher do a0 yet, Rive any experience Wis investigated now ifthe teacher's gender can explain pat ofthe di fexence between teacher ia average level and inthe rate of Change, Tou, gender i used asa betweenrubject variable A dum 2 i used ith wlus for men and 1 fr women. Tho eect of gender on tie rate of antes, Len the interaction eet of gener with experience i epesetad by tae produc of Z with t. The resltiog model i an extemon of modal of the ge of {i24) with a random slope Yu= me tan +75 + Uy but + Ra (29) Parameter a is the main elect for gender, while i the interoction eee between goer and experience "Tue reuks are in Table 122. Comparing the Geviancs of Moly 2 and 3 shows thatthe random slope of experience is quite significant x* 1560, .f. = 2, p< O00. This imps thatthe compound pymmery odo, Model? of Table 121, snot an aerate moce for tote data. The elect of tender and the geoder x exparience Iseraction, however, are at sigafcat 1 can be conchuced fom the tratioe -0.199/0169 = ~0.79 and DO6D/057 156. The fitted comtiance for the complete data vector under Medal 3 has elosets gives by (128). Filing nthe existed parameters 72, aad rs Fized occasions 173 ‘Table 12.2 Entimates for random slope models. Mode 3 “Fost Beg riieent 5 —Coicint — SE ‘Bia ria ‘Ost Oars ; Bet time 1 Ont 0.55 fect time 2 088s o0s7 fos time Oss 0058 Mala efck grader Interaction gender < experience Random Effet Parameter SE: evel two eos Interest variance 0233 0056 ona9 Sope variance 0017 0006017 Ingereept-slope covariance 0082 0017-0053 LeveLone (ue, oo or atid ace cow 00m coer anon Deviance 102.65 ‘rom Table 12.2 into (12.8) yields the covaiance matrix 0.281 0181 0.129 0077 A 0181 0194 att 0.076 FOD= | oi oan ois oo7s | * a) 0077 0076 0075 0.122 ‘and the correlation matrix 1.000 0.775 0.618 0416 0.75 1.000 0.671 0.494 0648 os71 1.000 0572 0416 0494 0.572 1.000 Roe ‘These matrices show that the variance becomes smaller as experience grows ‘and correlations are smaller between wider separated time points. However, ‘these values are conditional on the validity of the model with one random slope. We retura to these data below, and then will investigate che adequacy of this model by testing it against the fully multivariate model 121.3 The fally multivariate model ‘What isthe use of restrictions, such as compound symmetry, on the covarance matrix of a veetor of longitudinal measurements? There was a time (cay, before 1920) when the compound symmetry model was used for re peated measures because, in practice, it was impossible to get results for other models. In that time, also a complete data mateix was required. These limitations were overcome gradually between 1970 and 1990. = etions on the covariance matrix is cal parame ters should be kept small to refrain from overfiting and to avoid convergence174 Longitudinal data problems in the calculation ofthe eatimates. Another, more appealing, ar- ‘gument is that sometimes the parameters of the models with restricted covariance matries have nice interpretations. This is the case, eg, forthe random slope variance of model (12.7) But when there is enough data, one can also fit a model without restrictions on the covariance matrix, the fully ‘uitivariate model This alto provides a benchmark to asess the goodness of fit of the models that do have restrictions on the covariance matrix Some insight in the extent to which a given model constrains the co variance matrix is obtained by looking at the number of parameters. The covariance matrix of a m-dimensional vector has m(m + 1)/2 free parame. ters (m variances and m(m ~ 1)/2 covariances). The compound symmetry model has only 2 parameters. The model with one random slope has 4 parameters; the model with q random slopes has {(q + 1)(q +2)/2} +1 parameters, namely, (q+ 1)(q-+ 2)/2 parameters for the random part at level two and 1 parameter for the variance of the random residual. ‘This shows that using some random slopes will quickly increase the number of parameters and thereby lead to a better Siting covariance matrix. “The ‘maximum number of random slopes in a conventional random slope model for the fixed occasions design is q = m2, because with q = m-1 there is cone parameter too many. ‘This suggests how to achieve a perfect fit for the covariance matrix. The fully multivariate model ig formulated as a model with a random intercept and m= 1 random slopes at level two, and without a random pest at level ‘one. Alternatively, the random part at level two may consist of m random slopes and no random intercept. It is clear that, when all variances and covariances between m random slopes are free parameters, the number of ‘parameters in the random part of the model is indeed m(m-+ 1)/2. ‘The folly multivariate model is ttle more than a tautology, ‘Yu = fixed part +Uy (22.1) ‘This model is reformulated more recognizably asa hierarchical Linear model by the use of dummy variables indicating the measurement occasions. These dummies were defined already in (12.2): 1 t=h, dua {G teh. (222) ‘This leads to the formulation Yiu = fixed part + > Unidas - (22) ‘The variables Us for t = 1,..,m are random at level two, with expectations 0 and an unconstrained covariance matrix. (This means that all variances and all covariances must be freely estimated from the date.) This model does not have a random part at level one! It follows immediately {rom (12:11) that the covariance matrix of the complete data veetor,conditional on the explanatory variable, is identical to the covariance matrix of (Wry) This model for the random partis saturated in the sunse that it yields a perfect fit forthe covariance matrix. Fized occasions 175 ‘The multivariate model of this section can be applied also to data that do not have a longitudinal nature. This means that the m variables can ‘be completely different, measured on different scales, provided that they have (approximately) a multivariate normal distribution. Thus, this model ‘and the corresponding multilevel sofware offers a way for the multivariate analysis of incomplete data (provided that missingness is at random as defined, eg., in Little and Rubia, 1987). Example 12.4 Incomplete paired date For the compariton of the means of two variables measured for the same individuals, the poired-samples t-test isa standard procedure. The fully mulkivariate model provides the possibility to carry out a similar test ifthe data are incomplete. This is an example of the most simple multilevel data structure: all teve-two units (individuals) contain either 1 or 2 level-one units (measare- ‘ments). Such a data structure may be called an incomplete pretet~post-test, design if one measurement was taken before, and the other after, some kind of intervention ‘Tn the example of the teachers’ evaluations, consider the test of the mull hypothesis that the average evaluation on the proximity scale is equal on the two time points ¢ = 0 and t = 1. Of the 49 respondents for whom a ‘measurement on at lest one ofthese moments is available, there are 35 who approach to multivariate analysis, all these data can be used. The REML tstimation method is used (ef Section 4.6) because this will reproduce exactly the conventional t-test if the data are complete (i, all individuals have meagurements for both variables) "The model fitted ig Ya =p + nds + Us where the dummy variable d; equals 1 or 0, respectively, epeaing on whether Or not t= 1, The mall hypothesis that the means are identical for the two time points can be represented by ‘71 = 0". ‘Tuble 12.3. Bstimates for incomplete paired data Finod Beet Coeicient _S. ap Constant erm 0577 0.081 Ti Bifect time?” 0151 0.086 Deviance 5.05, ‘The results arein Table 12, The estimated covariance matrix ofthe complete ata vectr i: er) 0309 0.68 0.368 0.184 ‘The test ofthe equalicy ofthe two means, which isthe test of 7, is significant (¢ =0.151/0.064 = 230, eworsided p < 0.08)176 Longitudinal data Example 12.5 Fully multivariate model for teachers’ evaluations Continuing the example of teachers’ evaluations, we now fit the fully mab tivariate model to the data for the four time points ¢ = 0,1,2,3. First we consider the model that may be called the two-level multivariate empty model, of which the fixed part contains only the dummies forthe efecs of the measurement occasions. This is an extension of Model 3 of Table 122. The estimates of the fixed effects are presented in Table 124 ‘Table 12.4 The empty multivariate model for the teachers’ evaluations Fine Bit __Coeficient_SB. po Mesa ot enieO— 8ST OT mm Mew attime! 0717 0.066 fa Mean attime2 06500052 ta Mean attimes O48? 0457 Deviance suo The estimated covariance matrix of the complete data vector is 0303 0173 o1n 0112 0173 0196 0.098 0.118 0411 0.098 0.129 0.209 0412 0218 0109 014 ‘These estimates are the REML (residual maximum llihood) estimates (cf. Section 4.6) ofthe parameters of a multivariate normal distribution with incomplete data. They dlfer slightly fom the ettimates that would be obtained in the usual way, where means and variances are calevated ffom available data with pairwise deletion of mising values, The REML estimates are mare ffcient in the vense of having smaller standard errors. ‘An eyeball comparison with the fitted covariance matric for tbe model with one random slope, given in (12.10), suggests that the diftrences are minor. The deviance diferenca is x" = 102.66 ~ 91.10 = 11.56, df. = 6 (the covariance matrix of Model 3 has 4 free parameters, this covariance matrix has 10) with 0.05 < p< 0.10. Thus, the fully mukivariate model results ina Bt that is almost, but not quite significantly better thaz the it ofthe model with ‘one random slope, Covariance matric 12.19 suggests that the main difference Detween the measurements is that the fist measurement (t= 0) has a larger variance (0.903) than the later measurements. The second measurement also Jnas a somewhat larger variance (0.196) chan the last two measurements (0.120 and 0.141, respectively). ‘The filly multivariate mode! may have a slightly better ft than the random slope model, but an advantage of the random slope model is the clearer interpretation of the random part in terms of between-subject differences. ‘The rate of change in the models of Table 12.2 has a random component Ui with standard devistion YOOIT = 0.13. This reflects substantial unexplained differences between the subjects, and suggests that itis worthwhile to look {for within-between interaction effects. Such an interpretation is not dizectly obvious fom the results of the fully multivariate model, ‘Asa next step the elect of the teachers’ self-perception on the same prox imity scale is studied. This is a changing explanatory variable, measured at Beg (2213) Fized occasions aT ‘the same moments as the pupil’ evaluations which constitute the dependent variable. @ Seferaluation Deviance 6.42 “The estimates of the fixed elects are presented in Table 12.5. The estimated residual covariance matrix ofthe complete data vector is 0.228 0.115 0.070 0.044 (0415 0.120 0.049 0.057 (12.44) 0070 0049 0092 0.061 0.044 0.057 0.061 0.084 ‘The effect of the self-evaluation is strongly significant (¢ = 0.469/0.070 6:70,» < 0.00%), whch indicts that teacher and puis have leet some ‘agreement in thelr perceptions of the teacher's degree of cooperation. ‘The time effects in this model are controled forthe effact of the self-evaluation. ‘The residual effect of time, controlling for the teachers’ own perceptions, is highest at time ¢ = 1. a= Concluding remarks about the fully multivariate model ‘There is nothing special about this formulation asa hierarchical inear model of a filly multivariate model for repeated measures with a fixed occasion design. [eis just a mathematical formula. What is special 's that available Slgorthms and software for multilevel analysis acept this formulation, even without a random part at level one and can calculate ML or REML param rer estimates, In this way, multilevel software can compute ML or REML. tstimates for multivariate normal distributions with incomplete data, and also for multivariate regression models with incomplete data and with sets of explanatory variables that are different for the different dependent var!- ables. Methods for such modes have been in existence fr a longer period (eee Little and Rubin, 1987), but software was not realy available before the development of multilevel software. ‘The use of the dummy variables (12.2) has the nice feature that the covatiance matrix obtained for the random slopes is exactly the covariance ‘matrix for the complete data vector. However, one could also give the random slopes to other variables. The only requirement is that there are random slope for m linearly independent variables, depending only om the ‘measurement occasion and not on the individual. For example, one could tase powers of (¢~ to) where fo may be any meaningful reference point, € the average of all values fort. This means that one uses variables178 Longitudinal data = (t= to, (= 2am), still with model specification (12.12). Since the first ‘variable’ is constant, this effectively means that one uses @ random intercept and m —1 random slopes. Each model for the random part with a restricted covariance matrix is a submodel of the fully multivariate model. Therefore the fit of such a restricted model can be tested by comparing it to the fully multivariate ‘model by means of a likelihood ratio (deviance) test (see Chapter 6). For complete data, an alternative to the hierarchical linear model exists {in the form of multivariate analysis of variance (MANOVA) and multivariate regression analysis. This is docamented in many textbooks (eg., Maxwell and Delaney, 1990; Stevens, 1996) and implemented in standard software such as SPPS and SAS. The advantage of these methods is the fact that the tests are exact whereas the tests inthe hierarchical linear model formulation are approximate. For incomplete multivariate data, however, exact methods are not available. Maas and Snijders (1999) (also see Snijders and Maas, 1996) elaborate the correspondence between the MANOVA approach and the hierarchical linear model approach. 12.1.4 Multivariate regression analysis Because of the unrestricted multivariate nature of the fully multivariate model, itis not required that the outcome variables Yi are repeated mes- surements of conceptually the same variable. This model is applicable to ‘any set of multivariate measurements for which a multivariate normal dists- bution is an adequate model. Thus, the multilevel approach yields estimates ‘and tests for normally distributed multivariate data with randomly missing ‘observations (also see Maas and Saljders, 1999). ‘The fully multivariate model is also the basis for multivariate regression ‘analysis, but then the focus usually is on the fixed part. There are supposed to be individual-dependent explanatory variables Z; to Z,. Ifthe regress coeficents of the m dependent variables on the Zy are allowed all to be diflerent, the regression coefficient of outcome variable ¥; on explanatory variable Zy being denoted 7y., then the multivariate regression model with possibly incomplete data can be formulated by Yar = oe + OD ane des ans + SOU ne nes « (12.15) ‘This shows that the occasion dummies are fundamental, not only because they have the random slopes, but also because inthe fixed part all variables are multiplied by these dummies. If some of the Ze variables are not to be used for all dependent variables, then the corresponding crose-product terms daz can be dropped from (12.18). Fised occasions 179 1215 Rained rine xan deg canbe Explained variance for longitudinal data in fsa occasion designs con Selva analogous fo cxplated variance for grouped dat as defined in Ser tion 72. “The proportion of explained variance at level one can be defined as the proportional reduction in prediction error for individual measurements, frveraged over the m measurement occasions. A natural baseline model is provided by a model which can be represented by Yu =m + random part, ie, the fixed part depends on the measurement occasion but not on other explanatory variables, and the random partis chorea s0 as to provide a ood fit tothe data - “The proportion of explained variance at level one, Rf, then isthe proportional reduction in the average residual variance, ZY nt + when going fom the baseline model to the model containing the explanatory ‘arable ‘The proportion of explained variance at level two, denoted FB, can be dined a the proportional reduction in prediction error for the average Over the m measurements, Thus, itis given by the proportional reduction in the residual variance of the average, ne model is adequate, the irtue random part ofthe compound metry del adequate seoulne motels (121) Inti nse the defaltlous of and Fare Just, Begun thing n =m i the dain fF con eg anny tel ot adoeat Oe eld wet the Re athe mubvsate Tandom par. This yes the Stoane mel Yu = me + Ua = Dodie + Do Uae dae, which isthe fully multivariate mode! without covariates, cf. (12.11) and 1212). | ey vec ang tehaninshepoin cof explained variance can be related to the fitted complete data covariance State, S(Y"), The value of FZ is the proportional reduction in the sum Of dingonal values of tis matrix when going from the baseline model to {he mode including the explanatory variables. For the explained variance Gt level two, note that the variance of the average of a vector of random ‘ariabes's equal to the average ofall elements of the covariance matrix. ‘Therefore, Fis equal tothe proportional reduction inthe sum ofall values ofthe fitted complete data covariance matrix180 Longitudinal data Example 12.6 Fsplained variance for teachers exaluations. ‘Recall thatthe dependent variable on the examples ofthis chapter isthe evaluation of the teacher by the pupils on the proximity sale. We continue this example, now computing the proportion of variance explained by the teacher own evaluation of him- or herself on thie same scale, and the interaction of the self-evaluation with experience. First suppose that the compound symmetry model is used. The baseline ‘model then is Model 2 in Table 12.1. The variance per measurement is 0.121 + 0.072 = 0.193. The variance ofthe average over the m = 4 measurements js 0.1m + 0072/4 = 0.139, ‘Table 12.6 Estimates for models with self-evaluation. ES ——___ cae Fiza eer o Elleet Uae Tiss 87 ba Biles ime 1 0385 o.oaa ft Effect time 2 ozs D.000 is Bee time 8 02s. 2 Eifoc of ae-ealuation Osis Oat Racdom Best Parameter S.. “Teett0o (be, indiodeal) warina ar(U) 0089 oie Eevelone (be, ocasion) rertance: oP mar(R) 0.06 oas Deviance 24.26 ‘The results of the compound symmetry model wth the effect of sl-evaluation are in Table 126. Tn comparison with Model I of Table 121, inchsion of the ‘Sxed effect of self-evaluation has led especially to a smaller variance at the individual level. The residual variance per measurement is 0.069 + 0.066 = 0.135, TThe variance of the average over the m Y 0.066/4 ~ 0.085. Thus, the proportion of variance explained for level oe is Ri = 1 ~(0.135/0.193) = 0.0, while the proportion of variance explained for level two is RE = 1 ~ (0.085/0139) = 03. _Next suppose that the fully multivariate model is used, ‘The estimated covariance matrix inthe fully multivariate model is given in (12.13), the residual covariance matrix when controlling forthe effet of sal-evaluation is (12.14) ‘The sum of diagonal values is 0.760 forthe fist covariance matrix and 0.524 for the second. Hence, calculated on the basis ofthe fully multivariate model, Rj = 1—(0.524/0.760) = 0.31. The sum of all values is 2.202 forthe fist and 1.316 for the second matric. This leads to RE = 1 ~ (1.316/2.202) = 0.40. Comparing ths to the values obtained above shows that for the calculations of R? it does not make much difereace which random part is being used ‘Tae calculations using the fully multivariate model are more reliable, but in ‘ost cases the simpler calculations using the compound symmetry (random Intercept’) model will lead to almost the same values for R?. Var ble occasion designs 181 12.2 Variable occasion designs In data collection designs with variable measurement occasions, there is not such a thing as a complete data vector. The data are ordered according to some underlying dimension, eg. time, and for each individual data are recorded at some set of time points which is not necessarily related to the ‘time points at which the other individuals are observed. For example, body Jengths are recorded at a number of moments during childhood and adolescence, the moments being determined by convenience rather than strict planning. ‘The notation can be the same as in che preceding section: for individual 4, the dependent variable Yi, Is measured at occasions # = 1,...m. The time of measurement for Yi is ¢. This ‘time’ variable can refer to age, clock time, etc., but also to some other dimension such as one or more spatial ‘dimensions, the concentration of @ poison, etc ‘The number of measurements per individual, my, can be anything. It is not a problem that some individuals contribute only one observation. This ‘number must not in itself be informative about the studied process, however. ‘Therefore it is not allowed that the observation times ¢ are defined as the moments when some event occurs (eg, change of job); such data collection designs should be investigated by means of event history models. Greater rumbers m, give more information about intra-individual differences, of course, and with larger average ms one will be able to fit models with a ‘more complicated, and more precise, random par. Because of the unbalanced nature of the data set, the random intercept ‘and slope models are easily applicable, but the other models of the preceding section either have no direct anslogue, or an analogue that is considerably more complicated. This section is restricted to random slope models, and follows the approach of Snijders (1996), where further elaboration and back- ‘ground material may be found. There exist more introductions to multilevel ‘models for longitudinal data with variable occasion designs, eg., Rauden- ‘bush (1995). 12.2.1 Populations of curves An attractive way to view repeated measures in a variable occasion design, ordered according to an underlying dimension ¢ (referred to as time), is as observations on a population of curves. (This approach can be extended to two- or more-dimensional ordering principles.) ‘The variable Y for individual i follows a development represented by the function F(t), and the population of interest is the population of curves F,. The observations yield snapshots of this function on a finite set of time points, with superimposed residuals Ry, which represent incidental deviations and measurement error: Ye=R@Q + Re (6=1,.4m)- (12.16) Statistical modeling consists of determining an adequate class of functions AF; and investigating how theso functions depend on explanatory variables.182 Longitudinal data 12.2.2 Random functions Tt makes sense here to consider models with functions of tas the only explanatory variables. This can be regarded as a model for random functions that has the role of a baseline model comparable to the role of the empty ‘model in Chapter 4. Modeling can proceed by first determining an adequate random function model and subsequently incorporating the individual-based explanatory variables One could start by fitting the empty model, ie, the random intercept model without any explanatory variables, but this is such a trivial model when one is modeling curves, that it may also be skipped. The next simplest model isa linesr function, Full) = Bu + Bu(t~t0) - ‘The value to is subtracted for the same reasons as in model (12.7) for the fixed occasion design. It can be some reference value within the range of ‘observed values for t If the y's (h = 0,1) are split into their population average’ ‘ao and the individual deviation Uss = xs ~ reo, similar to (6.2), this gives the following model for the observations: You = wo + ‘ho (t=ta) + Une + Ue (to) + Re azar ‘The population of curves now is characterized by the bivariate distribution ofthe intercepts Ay at time ¢ = to, and the slopes fy. Te average intercept at t = to is too and the average slope is 740. ‘The intercepts have variance 1g = var(Uo), the slopes have variance 7? = var(Vi,), and the intercept- slope covariance is r = cov(UqUi,)- In addition, the measurements ox- hibit random deviations from the curve with variance o? = var(R). This level-one variance represents deviations from linearity together with measurement inaccuracy, and can be used to assess the fit of linear functions to the data, ‘Even when this model fits only moderately well, it can have a meaningfid interpretation, because the estimated slope and slope variance still will give fan impression of the average increase of Y per unt of time. Example 12.7 Retarded growth In several examples in this section we present result for a dataset of children with fetarded growth. More information about this esearch can be found in Rokers-Mombarg et al. (1997). The data set concerns chidren who visited 1 pacdiatrcian-endocrinologist because of growth retaniation. Length measurements are available at irregular ages varying between O and 30 years, In the preseat example we consider the measurements for ages from 5.0 to 10.0 years. The linear growth model represeated by (12.17) is fitted to the data, For this age period there are data for 335 children available, establishing a total of 1,888 length measurements. Length is measured in cestimeters (cm) ‘Age is measured in years and the reforce age ig chosen as fo = 5 years, ‘the intercept refers to lengths at § years of age. Parameter estimates for ‘model (12.17) are presented in Table 127 "The notation withthe parameters has nothing to do withthe 7 parameter used ‘arier fo thin chapter fa (124), but ia coistent with tho ootation le Chapter 5. Variable occasion designs 153 ‘Table 12.7 Linear growth model for §-10-year-old children with retarded gowh, 7 are jon Teter tho Age 553008 Random Bifect Parameter “Teeeltan (he, iadiodeal) randoms eect rE Invercept variance 1873181 TF Slope variance for age 16s 06 Thu Intereep-lope corrance — ~328 046 Lewlone (te, occasion) varience: of Residual variance os as Deviance ro90 57 ‘The levelone standard deviation i so low (¢ = 0.9 cm) that it can be concluded that the ft ofthe linear growth model for the period from 5 to 10 Jears is quite adequate. Deviations from linearity wll be usually smaller than Fem (which ia sbightly more than tro standard errors). This is notwithstand- ing the fact that, given the large sample size, tis posible that the fit can be improved significantly from a statistical point of view by including nos-linear growth terms. "The intercept parameters show that at 5 years, these children have an average length of 96.3 cm with a variance of +47 = 19.70+0.82 = 20.61 and fan associated standard deviation of V206i = 4.5 cm. The growth per year is ort Uo, which has an estimated mean of §.5 em and standard deviation of, YiBs = 1.3 con. This implies that 95 percent of these children have grow’ Tates, averaged over this fiveyear period, between 5.5 ~ 2x13 = 29 and 552% 13 = 81 cm per year, The slope-intercept covariance i negative: children who are relatively shore at 5 years grow relatively fast between 5 and 10 years, Polynomial functions ‘To obtain a good fit, however, one can try and fit more complicated random parts. One possibility is to use a polynomial random part. Tihis means that fone of more powers of (f ~ fa) are given a random slope. This corresponds to polynomials for the function Fi, Frit) = Bou + Bus(t to) + Bri(t ~ to)? + ~- + Bult to)", (02-18) where r is a suitable number, called the degree of the polynomial. For example, one may test the null hypothesis that individual curves are linear by testing the null hypothesis ‘r = 1° against the alternative hypothesis that the curves are quadratic, ‘r = 2’, Any function for which the value is determined at m points can be represented exactly by a polynomial of degree m— 1, Therefore, in the fixed occasion model with m measurement occasions, it mals no sense to consider polynomials of degree higher than. m=.ae Longitudinal data Usually the data contain more information about inter-individual differ. ‘ences (corresponding to the fixed part) than about intra-individual differ. fences (the random part). For some of the coefcients fy. in (12.18), there will be empirical evidence that they are non-zero, but not that they are variable actoss individuals. Therefore, as in Section 5.2.2, one can have a fSxod part that is more complicated than the random part and give random slopes only to the lower powers of (t ~ t9). This yields the model Ya = rw + Yo mnolt— to) + Wor + So Uia(t— ta)" + Ras, (12.19) a a which has the same structure as (5:4). For h = p+ Lyenyr, parameter ‘no is the value of the coefcient fyi, constant overall individuals i. For = 1,...,p, the coeflclents Gy, aze individual dependent, with population average ‘ho and individual devistions Uys = By ~ ro. All vaclances and covariances of the random effects Uns are estimated from the data. The ‘mean curve forthe population is givea by ECR) ‘900 + D> molt ~ to)" (12.20) Numerical difficulties appear lew often in the estimation of models of this kind when fy has a value in the middle of the range oft values in the data set than when fis outside this range or at one ofthe extremes of thia ‘Therefore, when convergeace problems occur, itis advisable to try and work with a value of fo clote to the average or median value of f. Changing the ralue of fy only amounts to a new parametrization, i. diferent fy leads to diferent parameters 7 for which, however, formula (12.20) constitutes ‘the same function and the deviance of the model (given that the software wil ealeulate it) alo isthe same. ‘The number of random slope, ps not larger than r and may be considerably smaller. To give a rough indication, r +1 may not be larger than the total number of distinct time points, i, the number of different values of tin the entire data set for which obsecrations exis als it should aot be larger than a small fraction, say, 10 percent, of the total number of observations 5,m,, On the other hand, p rarely will be much larger than the maximuin number of observations per individual, max, Example 12.8 Polynomial grouth model for children with retarded growth. We continue the preceding example of children with retarded growth fr which length measurements are considered in the period between the ages of 5 and in the fied occasion deiga, the numberof random ees, incldieg the random foterept, cannot be larger than the momber of mearuremert ccvasions. In the vate ‘occasion design this ses upper bound doos not gure, becuse the variability af time points of observations can lead to riche Information about fntreindividual varatioos. But ifone obtains a model with clearly more random slopes than tbe maximum o all ‘hia may mean that one bas made az unfortunate choi ofthe functions of tine (poly, ‘somal e other) that consttute the random gart at level two, and it may be advisable to ty and fd another, and smaller, ox af fortions of ¢ forthe random part with ao eaally good Mt tothe data Variable occasion designs 185 clams ocired when 10 years. In iting polysomial model, convergence problems the reeence age fo wa cen 8 eas ut ot whe twas chor th sidpolt ofthe range, 78 as "cube moda, te, a model with polaomial of the tied dgrs, turned aut to yield a much better sata than the me modelo Table 12.7. Paramcer etimates ar shown ia Table 128 for the cabic model with fo =75, ous So te intecep paramere to children of thi age. ‘Tuble 12.8 Cubic growth model for §-10-year-old children with retarded growth. sat 5B rar oa a ee Ta 0 mrs oe, er “Boor me era om Som Random Bie: Vacmaes 8B “ite (es aT won epee Pee ee B Sipemameer—e 27 Om Th Sopeasanee (us)? 028 On 1B Slope variance (¢—to)* 0.085 0.000, Level-one (4. ezason) sariance of Rendon variance oar om Deviance 6803.75 For the level-two random slopes (Uoi,Uis,U2:,Uas) the estimated correlation matrix is 1007 0.27 008 017 10 OL ~084 027 01 10-038 os -0ad 038 10 leviance diference ‘The ft is much better thaa that of the inear model (deviance diferene 496.12 lr 9 degrees of feed). The random eect of the cic term is significant (the model with fixed elects up to the power r = 3 and with 7 = tandom slopes, not shown in the table, has deviance 6824.3, so the deviance Aiference fo the random slope of (t~to)” is 221.12 with 4 degrees of freedom). "The mean curve for the population (cf. (12.20) is given here by E(B) = 110.40 +. 5.25 (¢ 75) ~ 0.007 (¢ ~ 7.5)? +0.009(¢ — ‘This deviates hardly, and not significantly, from a straight line, However, the individual growth curves do diffe from straight lines, but the pattern of ‘atiation implied by the level-two covariance matrix is quite complex. We return to this data set in the next example186 Longitudinal data Other functions ‘There is nothing sacred about polynomial functions. They are convenient, reasonably flexible, and any reasonably smooth function can be approximated by a polynomial if only you are prepared to use a polynomial of 1 suficently high degree. One argument for using other classes of functions is that some function shapes are approximated more parsimoniously by other functions than polynomials. Another argument is that polynomials are wobbly: when the value of a polynomial function F(t) is changed abit at one value of f, this may require coefficient changes that make the function change alot at other values oft. In other words, the fitted value at any given value of ¢ can depend strongly on observations for quite distant values of t. Tis kind of sensitivity often is undesirable. One can use other functions instead of polynomials. If the functions used are called f(t), f(#), then instead of (12.19) the random function is modeled as Alt) = Bos + Dow Salt) (az.21) land the observations as Yu oo + Sone Salt) + Uri + Sn Salt) + Rar 42.22) What is a suitable class of functions, depends on the phenomenon under study and the data at hand. Random functions that can be represented by (12.22) are particularly convenient because this representation is a linear function of the parameters 7 and the random effects U. Therefore, (12.22) defines an instance of the hierarchical linear model. Below we treat various classes of functions; the choice among them can be based on the deviance Dut also on the level-one residual variance o}, which indicates the size ofthe deviations of individual data points with respect to the fitted population of factions. Sometimes, however, theory or data point to function classes where the statistical parameters enter in a non-linear fashion. In this chapter we restrict attention to linear models, ie., models that can be represented as (12.22). Readers who are interested in nos-linear models are referred to more specialize literature such as Davidian and Giltinan (1995) and Hand ‘and Crowder (1996, Chapter 8). Piecewise linear functions ‘A class of functions which is fexible, easy to comprehend, and for which the fitted functions values have a very restricted sensitivity for observations made at other values of, is the class of continuous piecewise linear fonctions. These are continuous functions whose slopes may change discon tinuously at a number of values of ¢ called nodes, but which are linear (and hence have constant slopes) between these nodes. A disadvantage is their angular appearance. Variable occasion designs 187 ‘The basic piecewise linear function is Linear on a given interval (tf2) ‘and constant outside this interval, as defined by * (st) $= (u 1 this yields functions which are smooth and do not have the kinky appearance of Piecewise linear functions. An example of a quadratic spline (Le., a spline ‘with p= 2) was given in Chapter 8 by equation (8.8) and Figure 8.1. This equation and figure represent a function of IQ which is a quadratic for 1. <0 and also for 1Q > 0, but which has diferent coeficieats for these two domains. The poiat 0 here is the node. The coeffcents are such that the function and its derivative are continuous also in the node, as can be seen from the graph, Therefore itis a spline. Cubic spline (p= 3) also are often used. We present here only a sketchy introduction to the use of splines. For 1a more elaborate introduction to the use of spline functions in (single-level) regression models, see Seber and Wild (1989, Section 9.5). ‘Suppose that one is investigating the development of some characteristic over the age of 12 to 17 years. Within each ofthe intervals 12 to 16 years and 15 to 17 years, the development curves might be approximately quadratic (his could be checked by a polynomial trend analysis for the data for these intervals separately), while they are smooth but aot quadratic over the entire range from 12 to 17. In such a case it would be worthwhile to try a ‘quadratic spline (p = 2) with one node, at 16 years. Defining t = 15, the basic functions can be taken as AO t (Ginear function) wy = {48 GSE Gout stel6) gaan 80 = {Geap GSE oasatenen oft. ‘The functions fz and fy are quadratic functioas let of f, and right of t;, and they are continous and have continuous derivatives. That the functions aze continuous and have continuous derivatives even in the node ‘can be verified by elementary calculus or by drawing a figure. ‘The individual development functions are modeled as Fu) — Bor + Bus fil) + Aas Jolt) + As f(t) (12.25) If Bu = Bu, the curve for individual i is exactly quadratic. The freedom to have these two coefficients differ from each other allows to represent functions that look very different from quadratic functions, eg, if these coefficients have opposite signs then the function will be concave on one190 Longitudinal data de of t1 and convex on the other side. Equation (8.3) and Figure 8.1 brovide an example of exactly such a funetion, where ¢ is replaced by SES and the node is the point SES ‘The treatment of this model within the hierarchical linear model ap- roach is completely analogous to the treatment of polynomial models. The fanctions fi, fay and fy constitute the fixed part of the model as well a8 the random part of the model at level two. If there is no evidence for individual differences with respect to the coefficients fa, and/or Ay, then these could be deleted from the random part. Formula (12.25) shows that a quadratic spline with one node has one parameter more than a quadratic function (4 instead of 3). Each node added further will increase the umber of parameters of the function by one. ‘There 's considerable freedom of choice in defining the basic functions, subject to the restriction that they are quadratic on each interval between adjacent nodes, and are continuous with continuous derivatives in the nodes. For ‘two nodes, t1 and ta, a possible choice is the following. This representation employs a reference value to that is an arbitrary (convenient or meaningful) value. Tt is advisable to use a fp within che range of observation times in the data. The basie functions are Alt) tty (linear function) fit) = (t-to)* (quadratic function) Ho = {5 SE ounce oft) 42.28) 0 (St) HO = {bigy ESB undetic natt of, ‘When these four functions ate used in the representation Fut) = Ba + Bu Hilt) + Bas Salt) + Bo ft) + Bac ald, coefficient fa: is the quadratic cveffcient in the interval between ty and fa, while By; and Ay are the changes in the quadratic coefficient that occur ‘when time ¢ passes the nodes tor f, respectively. The quadratic coeficient for t ty it is Bs + Bas The simplest cubic spline (p = 3) has one node. If the reference point tis equal to the node, then the basic functions are Alt) = tnt (linear function) Alt) (t- to)? (quadratic function) ne = {GP ESE asic tse ote) (2227 4) = {bgp GSR een ores For more than two nodes, and an arbitrary order p of the polynomials, the basic spline functions may be chosen as follows, for nodes denoted by Variable occasion designs 191 ty totwe ft) = (toto o (est) (12.28) ful) = { Gem» (>t) ‘The choice of nodes is important to obtain a good fitting approximation. Since the spline functions are not-linear functions of the nodes, formal op- timization of the node placement is more complicated than fitting spline Functions with given nodes, and the interested reader is referred to the liter- ‘ature on spline functions for the further treatment of node placement. If the individuals provide enough observations (e., mi is large enough), plotting observations for individuals, together with some trial and error, can lead to 22 good placement of the nodes, Example 12:10 Cubic spline modes for retarded growth, 12-17 years In this example we aggin consider the lenge measurements of children with retarded grow studied by Rekers-Mombarg etal. (1997), but now the focus {son the age range fora 12 to 17 years. After deletion of cases with missing, data, in this age range there are 8 total of 1941 measurements which were taken for a sample of $21 children. Some preliminary model fits showed that quadratic splines provide a better fit than piecewise linear functions, and cubic splines fit even better. A reasonable model is obtained by having one aode at the age of 15.0 years. ‘The basic functions accordingly are defined by 12.27 with to = 18.0. All these functions are 0 for t= 15 years, 90 the intercept refers to length at this age. ‘The parameter estimates ae in Table 12.10. ‘Table 12.10 Cubic spline growth model for 12-17-yoar-old children with retarded growth. Fel Bet “toate = ae oe ef tn 643.18 1 ty 3 as Te pei) ote toe GS, 8s aon Bet vane 8 et ee Pg 3 Sywmee ‘en an a we ta “sm om oi192 Longitudinal data ‘The correlation matrix of the level-two random effects (Uos,..,Uu) is este mated as 100.26 -031 032 001 025 10 0.45 -0.08 ~082 Ry=| 031 04 “089-071 032-0108, 10040 oor -082 040 19 ‘Testing the random effects (not reported here) shows that the functions ‘fi to fe all have significant random effects. The levelone residual standard ‘eviation is only & = 0.788 = 0.54 em, which demonstrates that this family of functions fits rather closely tothe length measurements. Notation is made ‘more transparent by defining wana {seo (<0) 20), ° (t <9) woe{i, £83 ‘Thus, we denote fa(t) = (t- 15)2, felt) = (t~19)}.. The mean length curve can be obtained by filing in the estimated fixed coeficieats, which yields E(E(O) = 150.00 + 6.43(¢ ~ 15) + 0.25(¢—15)? 0.038 ¢ ~ 15)2 — 0.529(¢ ~19)%,. This fonction canbe dfeemiaed to yield the mean growth rate. Note shat 4 fy(fdt = ~3 (618)? and d fat)/4t =3(E— 15). This implies thatthe ‘ean growth rte is estimated ax mean growth rate =6.43 + 0.25(¢~18) + 0.114(¢ —25)2 =1587(¢~ 15) For example, forthe minimum of the age range considered, ¢ = 12 years, this 18643 ~ (0353) + (0.1149) +0— 671 cm/yea. This ia lite bit larger ‘than the mean growth rate found in the preceding examples forages fom 5 10 10 years. For the maximum age inthis range, ¢ = 17 years, onthe other hhand, the average growth rate is 43 + (0.25 %2) + 0 ~ (L587 4) n/yenr, indicating that growth has amos sopped at this age. ‘The results of thin model are ilurtrated more clearly by a graph. Figure 124 presets the average growth curve and a sample of 15 random curves from the population define! by Table 1210 and the given correlation matrix. The overage growth curve doesnot deviate sotieeably from a linear curve for age below 16 years. It levels off afer 16 years. Some of the randomly drawn romth carves are decreasing inthe upper part ofthe range. This sa obvious impossibility forthe rel growth carve, and indicates thatthe model isnot completely satisfactory fr the upper part ofthis age range. This may be related tothe fact thatthe number of measurements is rather lw at age over 16. years Variable occasion designs 198 fength x70} 160 150 ee 40] 130] 120) wea ¢ (age in years) Figure 12.1 Average growth curve (2) and 15 random growth curves for 12-17-year-olds for cubic spline model. 12.2.8 Explaining the functions Whether one has bee using polynomial, spline, or other functions, the random function model can be represented by (12.21), repeated here: Ait) = Bu + D0 Bas fall) (42.21) ‘The same can now be done as in Chapter § (so p. 78): the individual dependent coefcients Py: can be explained with individual-level (Le, level two) variables. Suppose there are q individual-level variables, and denote these variables by Z, to Zy. The inter-individual model to explain the cocficients fy to Br then is Bra = 700 + 7a is + + The Fes + Uae (1229) Substitution of (12.29) in (12.21) yields Y= 0 + Sona h(t) + Soran + OY mas fil $0 + SU fal) + Ry. (1230) ‘We see that cross-evel interactions here are interactions between individual- dependent. variables and functions of time. The same approach can be followed as in Chapter 5. The reader may uote lat this model selection approach, where frst a random function model is constructed and then the individual-based coefficients are approached as dependent variables in regression models at level two, is just what was described in Section 6.4.1 as ‘working upward from level one’.194 Longitudinal date Example 12.11 Explaining growth by gender and length of parents ‘We continue the analysis of the retarded growth data of the 12-17-year-olds, ‘and now include the childs gender and the mean length ofthe child's parents, Leaving out children with missing value for mother's or father's leagth left ‘321 chiléren with a total of 1,941 measurements. Gender is coded as +1 for girls and 1 for boys, so that the other parameters give values which fare averages over the sexes. Parents’ length is defined as the average length of father and movher minus 165 cm (this value is approximately the mean of the lengths of the parents). For parents’ length, the main effect and the Interaction effect with fy (age minus 15) was included. For gender, the main dct was included and also the interactions with fi and fa. ‘These choices ‘were made on the basis of preliminary model fits. ‘The resulting parameter ‘estimates are In Table 12.11 ‘Table 12.11 Growth variability of 12-17-year-old children explained by gender and parents length ae 130 a th fe oats Sm fe fara) oasis me Petey tans bas Me falcascnem otis) “oss Bane hate chistes RM Rxenae cae okte mR Rie cage omar Rete ee cas ar, Ti popes eth Sane Sousa Random Bifest__Varinnce_S.8. Teas fe, inca doe a a ae TR Sopesnnnee asa 2 Gopemmeecs aero Slope variance fs 0.132 0.020 3 Sopemeanee Oa te Level-one (1.¢., occasion) variance: ofall woes "Gam cas ‘The correlation matrix of the level-two random effects (Vs, mated as 10022 038 035 0.97 . 022 10 038 005 —081 fig=| -038 038 10-091 -o75 035-005 -091 10 048 007 -081 075 048 10 ‘The fixed effect of gender (71) isnot significant, which means that at 15, yrs there is not a sigrifcant diference in length of boys and girls in this population with retarded growth. However, the interaction effects of gender Variable occasion designs 195 ‘with age, 71 and 721, show that girls and boys do grow in different patterns ‘uring adolescence, The coding of gender implies that the average length difference between girs and boys is given by 2 (rm + ra filt) + 72 Al) 71 = 2532(t~ 18) ~ 0:724(¢ ~ 15)? . ‘The girl-boy difference in average growth rate, which is the derivative ofthis function, is equal to ~2.532 — 1.448 (t~ 15). This shows that from about the age of 13 years (more precisely, fort > 18.25), girls row on average slower ‘than boys, while they grow faster before this ag. Parents length bas a strong main effect: for each cm extra length of the parents, the children are on average 0.268 cm longer at 15 years of age. Moreover, for every cin extra of the parents, on average the children grow facer by 0.03 em/year. a cn andthe dap aan ofan fave essed, ‘compared o Table 12.10, The residual variance at level one remained the same, whichis natural since the iocluded effects explain differences between curves and do not yield better fisting curves. The deviance went down by 113.88 points, (4,f. = 5, p < 0.0001). 12.2.4 Changing covariates Individual-level variables lke the Zy of the preceding section are referred to ‘as constant covariates because they are constant over time. It is also poss- ble that changing covariates are available, like changing social or economic ‘Grcumstances, performance on tests, mood variables, ete. Fixed effects of Such variables can be added to the model without problems, but the meat forward model selection approach is disturbed because the changing covari- ‘te normally will not be a linear combination of the functions fy in (12.21). Depending on, eg, the primacy of the changing covariates in the research ‘question, one can employ the changing covariates in the fixed part right from the start (Le., add this fixed effect to (12.19) or (12.22)), or incorpo- rate them in the model at the point where also the constant covariates are considered (add the fixed effect to (12:30)). Example 12.12 Cortisol levels in infants, ‘This example reanalyoes data collected by De Weerth (1998, Chapter 3) in a study about stress in infants, In this example the focus is not on discovering the shape of the development curves, But on testing a hypothesized effect, while taking into account the longitudinal design of the study. Because of {he complesty ofthe data analysis, we shall combine techniques described in various of the preceding chapters. "The purpose of the study wae to investigate experimentally the oft of a stcesafl event on experienced stress as meatured by the cortisol hormone, and ‘whether this elec is stronger in certain hypothesized ‘regression weeks’. ‘The ‘Ccpmrimental subjects were 18 normal infants with gestational ages ranging from 17 to 37 weeks, Gestational age is age counted from the due day om ‘which the baby should have been born after a complete pregnancy. Each Infant was visited repeatedly, the numberof visits per infant ranging from 8 to 13, with a total of 222 visits. The infants were divided randomly into aa196 Longitudinal data experimental group (14 subjects) and a control group (4 subjects). Cortisol ‘was measured from saliva samples. The revearcher collected a saliva sample at the start of the session; then a play session between the mother and the infant followed. For the experimestal group, during this sesion the mother suddenly put the infant down and left the room in the control group, the mother stayed with the child. After the session, a saliva sample was taken again, This provided a pretest and a pos-cest measure ofthe infan:'s cortisol level. In this example we consider only the effect of the experiment, not the effect of the ‘regression weels’. Further information about the study and about the underlying theories can be found in Chapter 3 of De Weerth (1998). ‘The post-test cortisol level was the dependent variable. Explanatory vari ables were the protest cortisol level, the group (coded Z = 1 for infants in the experimental group and Z = 0 for the control group), and gestational age Gestational age isthe time variable. Tn order to avoid vary steal coeficients, the unit is chosen as 10 weeks, so that ¢ varies between 1.7 and 37, A pre- inary inspection of the joint distribution of pretest and post-test showed ‘that cortisol levels had a rather skewed distribution (as is usual for cortisol measurements). This skewness was reduced satisfactorily by using the square root of the cortisol level, both for the pretest and for the post-tew, These variables are denoted X'and Y, respectively. The pretest variable, X, is a hanging covarace ‘The la of inital value proposed by Wilder in 1956 implies that an inverse relation is expected between basal cortisol level and the subsequent response to a stresor (of. De Werth, 1996, p. 41). An infant with a low basal level of cortisol accordingly will react to a stresor by a cortisol increase, whereas ‘an infant with a high dasa corsaol level wil rect to stresaor by 8 cortisol decrease. The basal cortisol level s measured here by the pretest. The average of X (the square root of the pretest cortisol value) was 280. ‘This implies to stress by a relatively high value for Y whereas infants with 2 more than about 2.80 are expected to react to stress by a relatively low value of Y. The play session itself may also lead to change ia cortisol value, so there may be ‘a systematic diffrence between X and Y. Therefore, the sess reaction was investigated by testing whether the experimental group has a more strongly negative regresion coeficient of Y on X ~ 2.80 than the comtrol group; other words, the research hypotbesis is that there isa negative interaction Detwoen Z and X'— 280 in their effect oa Y. Te appeared that a reasonable fist model for the square root of the post- tex cortisol level, if the diffrence between the experimental and contol group is not yet taken into account, is the model where the changing. covariate defined as X-~2.80, has a fixed effec, while time (ie, gestational age) has a Hinear fixed as wel as random effect: Yo = 70 + tot + ae (2 —280) + Uoi + Unt + Res ‘The results are given as Model 1 in Table 12.12. This model can be regarded as the null hypothesis model, against which we shall tes: the hypothesis of the stresor fect. ‘The random effect of gestational age i significant. (The model without this effect, not shown in the table, has deviance 289.46, o that the deviance comparisoa yields y* = 889, d.f. = 2, p< 0.1; this text uaee the halved. ‘value from the chi-squared dietibution, se Section 6.2.1). The pretest has Variable occasion designs 197 ‘Table 12.12 Estimates for two models for cortisol data Model 1 Mode 2 ad ST og SE Coico SE Sen ono se Goats! am oily ~0185 mae fase Dass 002 (O06 = tom Garr Se Bx (X20) wom 0.108 Parameter SE. Panmeter_ SE. Random Effect Teves ie, eal dom ls 7 FP Tetereeg valance 12085132 GP Slope variance oust aoe 0.15 082 TE Iuereaptslope covrianee -043 0.23 045,03 Lgye-one (Le, ocssion) ve ee Device 20057 70.42 sap aaces 88 comm ens ‘aie ie mec we sie oie a A ot ed cn ae a ee a Sa eat fixed effect of the tal group on eee ear te ess sm re ete a er a ne ee wt Saye hile a wd 2 a ee Sn se el ee Se eta ee son rel 2 ie i re gem a et 2 2 fee Cone eg aga eae aera Pr cm wt pb, Th sate tee a cae met tw tes Te eal ae Gn) the value was S? = 3476 (df. = 0 = 13, p= 0.0008). With the Te ctne eager ern Sinemet ens we Patel eeecetgrn ener ee ce a i yn, nh na si ae, ei we eta cae ‘Subsequent data analysis showed that having been asleep (represented by a ‘dummy variable equal to 1 if the infant had deen asleep in the half hour pre- Sayan ct eo ce a iy198 Longitudinal data at level one (see Chapter 8). Including thee two effects oon (ie Chater 8) o clfocs led to the estimates ‘Table 12.13 Estimates for a model controlling for having been asleep, M. ‘oo Taleo 3m Os ‘ro Govttioal age oie Dee vm x20 05a Ose mz ons one me Extk 2m) Tots Obie Je Stapling 038 OI Random Bfet P rameter SB. “Tp too (oe, atRaRaT random ete Intercept vetnce 0889 046 74 Slope variance oor oust Jo Intercepts comriance 028 DIT Lgsebone (ce, ocetion) < » ccason) sariance porometers: of Buse roidval waranee 0190” ots oo Steping eect, oats OKs Deviance 25131 mci lst oping ar ontly ways (he capa be. a the devinaces of Modebs 2 aad 3 yields x? = 2511, Af. = 2 pe 0.0001). ‘The estimated leve-one variance is 0.130 for childzen who had not slept and (aig formula (81)) D362 for cite who bad spe But the stewor (Z x (X ~ 2) iteration) fice sow bar ets signcance (= 182, 23). The mow sgsfeant tandardaed level wo reidal defined 4 (09) for ths mode obits forthe forth cid (j=, with te value 5) Bab dF = y's p= 0000 “ARongh te nah al ue, he Bontroni cotton now lends to a sigicance probably of 1s%0.0049 = 000, which int alarnngly low. The fi ofthis model therlore seems satisfactory, and the etimates do ot support the hypothesiand sites Se ec Howrey thw lca be ieee weer at the seer ef, nan he prea dom Be in Fede ive sgn and the aumnber of al ashes that the power may have beea ow ms Ot rey LEBe, so 1: can be concluded tha tb very importast to contol forthe to contol forth inte avng seat hry bre he ly ston ints ae th eanakn erence betwee a signicat and a nan-sgaifiast afl. Having eye ot only leds toa higher potter cortl value, controling for te retest ‘alu, Bt alo tigln the edal vance atthe ocaton level Autocorrelated residuals 199 12.8 Autocorrelated residuals Of the relevant extensions to the hierarchical linear model, we briefly want cx neation the hierarchical linear model with outocorrelated residuals, In the fixed occasion design the assumption thatthe level-one residuals Ra are independent can be relaxed and replaced by the assumption of first-order autocorrelation, Ry = FO, Rares Ru + VI-P Res 21) (1231) wrbere the variables RO) are independent identically distributed variables. ‘The parameter p is called the otlocorelation coefcient. This mode rep- events a type of dependence between adjacent observations, which quickly “es out between observations which are further apart. The correlation between the residuals is (Re, Ras) = 0 Cael (12.32) ‘For variable occasion designs, level-one residuals with the correlation struc- Tare (1232) also are called sutocorrelated residuals, although they cannot te constructed by the relations (123) ‘These models are discussed by Dig- Ge Liang, and Zeger (1994, Setion 5.2), Goldstein, Healy, and Rasbash (4904), and Hedeker and Gibbons (19960). "The extent to which over-time dependence can be modeled by random slopes, or rather by autocorrelated residuals, or a combination of both, epends on the phenomenon being modeled, "This iswe usualy will have cate decided empirically; the multilevel computer programs MIXREG (see Hodeker and Gibbons, 19966), MLawiN (see Yang et al, 1998), and HLM Morson 5, as wel as the general statistical programs SAS (specifically, Proc Mixed) and BMDP (specifically, module SV) can be used for this purpose ‘These programs are discussed in Chapter 15.13 Multivariate Multilevel Models This chapter is devoted to the multivariate version ofthe hierarchical linear ‘model treated in Chapters 4 and 5. The term ‘multivariate’ refers here to the dependent variable: there are assumed to be two or more dependent variables. The model of this chapter is a three-level model, with variables as level-one units, individuals as level-two units, and groups as level-three ‘units. It may also be regarded as an extension of the fully multivariate ‘model, treated in Section 12.1.3, to the case where individuals are nested in groups. For the corresponding two-level’ model the reader is referred to Section 12.1.3. In the present chapter, the dependent variables are not necessarily supposed to be longitudinal measurements of the same variable (although they could be). ‘The nesting structure of Chapter 12, measurements nested within individuals, can be combined with the nesting structure considered eat! individuals within groups. This leads to three levels: measurements within individuals within groups. ‘As an example of multivariate multilevel data, think of pupils (leve-two units) 4 in classes (level:three units) j, with m variables, Yi to Yn, being measured for each pupil if data are complete. The measurements are the level-one units and could refer to, eg. test scores in various scholastic subjects but also variables measured in completely unrelated and incom- ‘mensurable scales such 8 attitude measurements. ‘The dependent variable is denoted Yius is the measurement on the h’th variable for individual ‘in group j. It is not necessary that for each individual in each group j, an observation of each of the m variables is available. (It must be assumed, however, that missingness is at random, Le, the fact that a measurement is not available is not related to the value that this measurement. would have taken, iit had been observed; if model with explanatory variables is considered, then the availability of a measurement should be unrelated to its residual. Just like in Chapter 12, the complete data vector can be defined as ‘the vector of data for the individual, possibly hypothetical, who does have 200 ‘The multivariate random intercept model 01 observations on all variables. This complete data vector is denoted by Vig Yong 1s posible to analyse all m dependent variables separately. However, there are sovra! reasons why i ay be sensible to analyse the data Jointly, ey a8 multivariate data 1. Cenclusions ean be drawn about the correlations between the dependeat variables, notably, the extent to which the correlations depend on. the individual and on the group level. Such conclusions follow from the partitioning of the covariances between the dependent variables over the levels of analysis. 2. The tests of specific effects for single dependent variables are more powerful in the multivariate analysis. This will be visible in the form (Of smaller standard errors. The additional power is negligible if the dependent variables are only weakly correlated, but may be consid- ferable if the dependent variables are strongly correlated while at the same time the data are very incomplete, i., the average number of ‘measurements available per individual is considerably less than rm. 13, Testing whether the effect of an explanatory variable on dependent ‘variable ¥; is larger than its effect on Ya, when the data on ¥; and Ya were observed (totally or partially) on the same individuals, is possible only by means of a multivariate analysis, 4. Tone wishes to carry out a single test of the joint effect of an explana tery variable on several dependent variables, then also 2 multivariate analysis is required, Such a single test can be useful, eg., to avoid the danger of chance capitalization which is inherent to carrying out separate test for each dependent variabl ‘A multivariate analysis is more complicated than separate analyses for each dependent variable, Therefore, when one wishes to analyse several dependent variables, the greater complexity of the multivariate analysis will have to be balanced against the reasoas listed above. Often it is advisable to start by analysing the data for each dependent variable separately. 18.1. The multivariate random intercept model Suppose there are covariates X; to Xp, which may be individual-dependent tr grovpdependent. ‘The random intercept model for dependent variable Yq is expressed by the formula Yigg = on tn zig + Manag Het phy + Uj + Ry (13)202 ‘Multivariate multilevel models {In words: for the hth dependent variable, the intercept is 7on, the regression coefficient on X, is wn, the coefficient on Xs is tpn, ete, the random part of the intercept in group j is Ung, and the residual is Rag ‘Tis is just a random intercept model lke (45) and (48). Since the variables ¥ to ‘Ym are measured on the same individuals, however, their dependence can be taken into account. In other words, the U's and F's are regarded as ‘components of vectois, Fay Oy Ry eal Ret Uns Instead of residual variances at level | and 2, there are now residual covari- ‘ance matrices, E=co(Ry) and T=cov(U;) ‘The covariance matrix of the complete observations, conditional on the ex: planatory variables, is the sum of these, va(¥)=E 47 (13.2) To represent the multivariate data in the multilevel approach, three nesting levels are used. The fist level is that of the dependent variables indexed by h = 1,...m, the second level is that of the individuals i = Aywsmjy and the third level is that of the groups, j = Iyu..N. So each measurement of a dependent variable on some individual is represented by ‘a separste line in the data matrix, containing the values i, j, hy Yiuj, 213) ‘and those of the other explanatory variables. ‘The multivariate mode! is formulated as hierarchical linear model using the same trick as in Section 12.1.3. Dammy variables di to dm are used to indicate the dependent variable, just like in formula (12.2). Dummy variable dy is 1 or 0, depending on whether the date line refers to dependent variable Ys or to one of the other dependent variables. Formally, this is expressed by 1 dans = { 0 Naas (233) With these dummies, the random intercept models (13.1) for the m dependent variables can be integrated into one three-level hierarchical linear ‘model by the expression Yay = Do roedany + D0 90 te danas Bay + Dos das + So Rasdns 094) All variables (including the constant!) are multiplied by the dummy variables. Note that the definition of the dummy variables implies that in the sums over 3 = 1,..,m only the term for s = h gives a contribution and ‘The multivariate random intercept model 203 ail other terms disappear. So this formula is just a complicated way of rewriting formula (13-1), "The purpose of this formulas that it canbe used to obtain a multivariate irarclal near model ‘The variable dependent random residuals Ry in this formula are random slopes at level two ofthe dummy variables, Fy being the random slope of dy, and the random intercepts Usj become the Fandom slope at level three ofthe dammy variables. There is no random rt at level one a ‘This model can be further specified, e.g, by omitting some of the variables X; fom the explanation of sotte of the dependent variables Yq. Ta mounts to dropping some ofthe terms "as dau fom (18-4). Another polity toinade varblespecic covert i analogy to the chang Tig covatates of Section 12.24." An example of this is a study of sca Pupils pecformance on several subjects (shove are the multiple dependent arable), using the pupils’ motivetion for each of the subjects separately as explanatory variables. Mutivarate empty model The, multivariate empty mode isthe mite model without elas toy vaiabln. For t= 2 viable, thi it jst the model of vecton 3.8.1. ‘Walung out formulae (131) and (134) without explanatory variables (Le, With pre O) yields the following formula, which isthe specication of the Three level muvariate empty model Yau = 900 + Uns + Ros = Ste ds + Dag danas +L Rots darts» (13.5) is empty model can be used co decompose the raw variances and cover Pusey Geto eve When tering tothe mulvarate empty model, the covariance matrix B= cov(Ry) . may be called the within-group covariance matriz while eye In the terminology of may be called the between-group covariance matrix. In the terminology ‘Chapter 3, these are the population (and not observed) covariance matrices Tn Chapter 3 we saw that, ifthe group sizes nj al are equal to n, then the covariance matrix between the group means is tein (13.8) (cf Equation (3.9). Section 3.6.1 presented a relatively simple way to estimate the within- and between-group correlations from the intraclass coeficients and the ob- Served total and within-group correlations. Another way sto fit the multivariate empty model (13.5) using multilevel sofeware by the ML or the EML method. For large sample sizes these methods will provide virtually204 Multivariate multilevel models the same resul's, but for small sample sizes the ML and REML methods will provide mere precise estimators. The gain in precision will be especially large if the correlations are high while there are relatively many missing data (ie, ther: are many individuals who provide less than m dependent variables, but missingness still is at random). Example 18.1 Language ond arithmetic scores in elementary schools ‘The example used throughout Chapters 4 and 5 is elaborated by analysing ot only the scores on the language test but also on the arithmetic test. So there are m=? dependent variables. ‘Table 18.1 Parameter estimates for multivariate empty model. Language Artumetic (Covariance) hed ha? Fie Ba SPE ron = intercept 2034 0.21855 —0.38 ———— Random Effect Pa SE Par SE Par 5. Betwecs-schools eariance mair rh svarUn,) "sao" 286 1276 18 Ta = cor(Uiy.Ua3) Ms 27 Witkin-rchools covariance mata: Gist 197 3225 096 m2 116 First the multivariate empty model, represented by (13.8), i fitted. The results are in Table 13.1. ‘The results for the language test may be compared with the univariate empty model for the language test, presented in Table 4.1 of Chapter 4. ‘The parameter estimates are slightly diferent, which is a consequence of the fact that the estimation procedure now is multivariate. The extra result of the mulkivariate approach isthe estimated covariance at level three, cov(Uis,U35) = 1454 and at level two, cov( Riu, yj) = 28.62. The corre sponding population correlation coefficients at the school level and the pupil level are, respetively, AAC Ts PRs, Bas) ‘This shows that especially the random schoo! effets for language aad arith metic are very strongly correlated For the ccrrelations between observed variables, these estimates yield correlation between individual of (ef. (18.2)) AO%is Yas) Ji900-+ 65:69)(12.78 + 82.25) ‘The multivariate random intercept model 205 and, for groups of a hypothetical size n = 80, a cortelation between group reas of ((13.8)) 14.54 + 28.62/90 aro BF 5 Yaa) = a0 + 64.54790)(12.76 + 32.25/30) Explanatory variables included are 1Q, the group mean of 1Q, and group size, As in the examples in Chapters 4and 9, the 19 measurements the ‘verbal 1Q from the ISI test and the average valve of 23.1 is subtracted from the group sie. The correspondence with formulae (13.1) and (13.4) ia that X {s 1G, Xa is the group mean of 1Q, Xs is group size and Xs = Xi Xa represents the interaction between IQ and group size. The results are in Table 13.2 ‘Table 18.2 Parameter eaimates for multivariate model for language and arithmetic tests Langusge Arithmetic (Covariance) ren he? eee “ron Tntoreape wat 0291035 mag 2422 00m ase 0.084 Ye TQ (Group men) 13302 Lal 027 “hn Group sise Dou oss © ose 0.001 TAL Ig x Croup wie -0.021 0010-0016 0.007 Random Bie Pe, SE.__Pa. SE Par. SE. —Fondaal tetwcen 2ho0Ts coearante mat = varns) 735, Lk 6050. Pisoni, Ua) 570 097 Tesidealwithin-ecoolscoverance mafia: oh = vera) ain 128 2407 03 (inj Ras) 1804 076 Deviance 3540.7 “Devine tase Calculating statistics forthe fixed effects shows that, except for the main eect of group size on either dependent variable, all fects are significant at the 0.08 significance eve, The residual correlations are p(Ui, U5) = 0.85 at the school level and p(Risj,Raus) = 047 at the pupil level. This shows that taking the explanatory variables into account has lei to somewbat smaller, but stl substantial residual corelations. Especially che school-level residual correlation is large. ‘This suggests that, also when controling for 1Q, mean 1Q, and group size, the factors at school level that determine language and arkhmete proficiency are the same. Such factors could be associated with School policy but also with aggregated pupil characerstica not taken into ‘sccount bere, such as average soco-economic tats. "Whea the interaction effect of group sie with IQs to be tested for both dependent variables simultaneously, this can be carried out by fiting the model from which these interaction effects are excluded. tn formula (13.4) this corresponds to the effects 741 and ex of diX« and daXe, The model from which these effects are exchided has a deviance of 28549.1, which is 844 lower than the model of Table 132. In a chirsquared distribution with .f.—2, this isa siguificant result (p < 0.05)206 Multivariate multilevel models 13.2 Multivariate random slope models ‘he notation i quite complex already, and therefore only the case of one random slope is treated here. More random slopes are, in principle, a straightforward extension : “ Suppose that variable X; thas a random slope for the various dependent Tafable, Dente for the Nth dependent variable the random intercept by fang and the random slope of X; by Uys. The model forthe h'th depen variable then is a “eventesk Yius = on + manzig + om + Ton ps (3.27) + Uong + Ving tiny + Ras» ‘The three-level formulation of this model is ¥uuy = Do-roedancs + 0 So the dana Zag + 2s danas + YoU roy danss tus + Do Ras daniy (138) This means that again there is no random part at level one, there are m random slopes at level two (of variables di to dy) and 2m random slopes at level three (of variables di to dy, and of the product vatiables dX; to dm X;). With this kind of model, an obvious further step is to tey and model the random intercepts and slopes by group-dependeat variables lke 14 Discrete Dependent Variables Up to now, it was assumed in this book that the dependent variable has a continuous distribution and that the residuals at all levels (Uo, Rays etc.) have normal distributions. This provides a satisfactory approximation for many data sets. However, there also are many situations where the depen- ‘dent variable i ‘so discrete’ that the assumption of a continuous distribution could lead to erroneous conclusions. Tis chapter treats the most frequently occurring kinds of discrete variable: dichotomous variables (with only two values), ordered categorical variables (with a small number of ordered categories, eg., 1, 2, 3, 4) and counts (with natural numbers as values: 0, 1, 2, 3, ete). 14.1 Hierarchical generalized linear models [important instances of discrete dependent variables are dichotomous variables (eg, success vs. failure of whatever kind) and counts (eg., in the study of some kind of event, the number of events happening in a predetermined time period). It is (usually) unwise to apply linear regression methods to such variables, for two reasons. "The frat reason is that the range of such a dependent variable is stricted, and the usual linear regression model might take its fitted value ‘outside this allowed range. For example, dichotomous variables caz be represented as having the values 0 and 1. A fitted value of, say, 07, still can be interpreted as @ probability of 0.7 for the outcome 1 and @ probability of 0.3 for the outcome 0. But what about a fitted value of ~0.23 or 1.08? ‘A meaningful model for outcomes that have only the values 0 or 1 should not allow fitted values that are negative or greater than 1. Similarly, a meaningful model for count data should not lead to negative fitted values. ‘The second reason is of a more technical nature, and isthe fact that for iscrete Yariabes there is often some natural relation between the mean and the variance of the distribution. For example, for 2 dichotomous variable ¥ that has probability p for outcome f and probability 1 ~ p for outcome 0, the meaa is eY=p and the variance is var(¥) = (1-2). aa) 207208 Discrete dependent variables ‘Thus, the variance is not a free parameter but is determined already by the mean. In terms of multilevel modeling, this leads to a relation between the parameters in the fixed part and the parameters of the random part. ‘This has led to the development of regression-like models that are more complicated than the usual multiple linear regression mode! and that take account of the non-normal distribution of the dependent variable, its re stricted range, ard the relation between mean and variance. The best- known method ofthis kind is logistic regression, a regression-like model for dichotomous data. Poisson regression is a similar model for count data, In the statistical literature, such models are known as generalized linear models, cf. McCullagh and Nelder (1989) or Long (1997). ‘The present caapter gives an introduction into multilevel versions of some generalized linear models; these multilevel versions are aptly called hierarchical generalized linear models. Much research and also software development is still being carried out to develop and implement statistical procedures for these models, so the introduction and overview provided in this chapter are necessarily limited. 14.2 Introduction to multilevel logistic regression Logistic regression (treated, eg., in Hosmer and Lemeshow, 1989; Long, 1997; McCullagh and Nelder, 1989; Ryan, 1997) is a kind of regression analysis for dichotomous, or binary, outcome variables, ie., outcome variables with two possibly values such as ‘pass / fail’, ‘yes / no', or ‘ike / dislike’. The reader is advised to study an introductory text on logistic regression if he or she is not already acquainted with this technique. 148.1 Heterogeneous proportions ‘The basic data structure of two-level logistic regression i a collection of N groups (‘units at evel two’), with in group j (j = 1,... N) arandom sample of nj level-one units (Sndividual’). ‘The outcome variable is dichotomous and denoted by Yy for levelone unit i in group j. ‘The two outcomes are supposed to be coded 0 and 1: 0 for ‘failure’, 1 for ‘success. The total sample size is dencted by M = S2,nj. Ione does not (yet) take explanatory variables into account, the probability of success is coustant ia each group. ‘The success probebity in group j is denoted P. In a random coeficient model (c. Section 4.2.1), the groupe are considered as being taken from a population of groups and the success probabilities in the groups, F;, are regarded as random variables defined in this population. The dichotomous outcome can be represented as the sum of this probability and a residual, Yy = Py + By (4a) In words, the outcome for individual in group j, which is either 0 or 1, is expressed as the sum of the probability (average proportion of successes) in this group plus some individual-dependent residual. This residual has (Uke all residuals) mean zero but for these dichotomous variables it has the Introduction to multilevel logistic regression. 209 penis property that it can assume only the values -Pj and 1~ P, since (142) must be 0 or 1. A further peculiar property the fact that, given the vale ofthe probabiity 7, the variance ofthe rsidual is var(y) = Fy(1-F) asa) tn accordance with formula, (141). “Equation (142) is the dichotomous analogue ofthe empty (or uncondl- tional) model defined for continious outcomes in equations (3.1 and (4.6). Section 3.1 remains valid for dichotomous outcome variables, with P take ing the place of +, except for one subtle dtineton. In the empty model for continuous outcome variables i was astumed that the level-one residual variance was constant. This i not adequate here because, in vew of formula (143), the groups have diferent within-group variances. Therefore the po- rameter ¢? must be interpreted here as the average residual variance, i.e., the average of (14.2) in the population of all groupe. With this modification in the interpretation, the formulae of section 3.3.1 stl are valid. For exam le, the intrcless correlation coefciene still defined by (3.2) and ean be tetimated by (0.12). Another defnition ofthe Itradasscortelation i also possible, however, as is mentioned below in Section 143.2 Since the outcome variable i coded 0 and 1, the group average =D (44) tow isthe proportion of success in group j. This i an estimate for the troup dependent probability P,. Sinlary, the overall average ¢ its Para GE Lw (145) here isthe overall proportion of succes Testing heterogeneity of proportions TTotest whether there ae indeed systematic dfferencis between the groups, the well known chi-squared test can be used. The tet statistic of the chi- squared test for a contingency table is often given in the familiar form 35(0 — B)?/E where O isthe observed and E the expected count in @ cell of the contingency table. In this case it can be written also a3 rf > 2 Sin, Fan he 146) x “Lo pG-P) (148) It can be tested in the chi-squared distribution with N — 1 degrees of free dom. Tis chi-squared distribution is an approximation valid ifthe expected numbers of successes and of failures in each group, nj¥y and nj(1 ~ Fs), {espectively, all are at lent 1 while 80 porcont of them are at least 5 (ef ‘Agresti, 1990, p. 246). "This condition will not always be satisfied, and the chrsquared test then may be seriously in error. For a lange number of groups the null distribution of X? then can be approximated by @ normal distribution with the corect mean and variance, cf, Haldane (1940) and210 Discrete dependent variables ‘McCullagh and Nelder (1989), p. 244; or an exact permutation test may be used. ‘Another test of heterogeneity of proportions was proposed by Com- smenges and Jacqmin (1904). The test statistic is Et {nf (vs?) }- ura ~ A) PO P)Y2 DH miley 0) Large values of thi statistic are an indication of heterogeneous proportions. This statistic can be tested in a standard normal distribution, ‘The fact that the numerator contains a weight of n} whereas the chi- squared test uses the weight ry shows that these two teste combine the s7oups in diferent ways. When the group sizes ny are different, i is posable that the two tests lead to diferent outcomes. ‘The theoretical advantage of test (14.7) over the chi-squared teat is that ‘thas somewhat higher power to test randomness ofthe observations against the empty model treated below, i.e., against the alternative hypothesis represented by (14.10) with 1g > 0. ‘The practical advantage is that it can bbe applied whenever there are many groups, even with small group size, provided that no single group dominates. A rule of thumb for the applica’ ton of this test that there should be at least IV = 10 groups, the Bagest ‘soup should not have a relative share larger than nj/M = 0:10, and the fatio ofthe largest group size tothe 10th largest group size should not be tore than 10. 47) Estimation of betwcen- ond within-goups variance The tru variance between the group- dependent probabiities ie. the popu: lation value of var(P)), can be estimated by formula (3.11), # = Shien ~ 5 where fe defined a in (3.7) by For dichotomous outcome variables, the observed between-groups variance 4s closely related to the chi-squared test statistic (14.6). They are connected by the formula PD xa Shave = 2G =F} ‘Tee within-goups vance in the dichotomous cas is funtion ofthe group averages, viz., (1) Introduction to multilevel logistic regression ou Example 14.1 Bzperience with cohabitation in Norway. ‘This example is sbout the influence of age and geographical regio on whether inhabitants of Norway have experience with living together with a partner ‘without being marred, using data collected in the ISSP-1994 survey (Inter- ‘ational Social Survey Programme, 1994) on Family and Changing Gender Roles. The dependent variable ie whether the respondent ever lived together swith a partner without being marred (‘experience with cohabitation’). Aer Geleting cases with missing oF inconsistent answers, there were 2079 respon dents classified into 19 geographical regions in Norway. The number of respondents per regioa ranged from 35 (in Finnmark) to 235 (in Oslo). It was ‘expected that cultural differences wil lead to differences between regions sad between age groups in the proportion of individuals having experiexce with cohabitation. In the whole data set, 48.4 percent of the respondents reported that they had ever cobabitated with a partner 'A two-level structure is used with che region as the second-level unit. ‘This is based on the idea that there may be diferences between regions that fare not captured by the explanatory variables and hence may be regarded as ‘Unexplained variability within the et of all regions. Figure 14.1 represents for each of the 19 regions the proportion of people in the sample who ever lived together without being married. ee oo os of 05 O8 oF O08 Figure 14.1 Proportion of experience with cohabitation ‘Are these proportions, ranging from 0:31 to 0.55, very different? The chi- squared test (14.6) for equality ofthese 19 propartons yields X? = 35.40, d.f. S18, 7 <01, Thos, thereis evidence for heterogeneity between the regions. ‘The estimated true variance betwoen the region-dependent proportions calculated from (8.11) i #? = 0.0022, thus the estimated true between-region standard deviation is # = V0.0022 = 0.047. With an average probability of (0.434, this standard deviation isnot very large but not quite negligible ether. 14.2.8 The logit function: Log-odds Itcan be relevant to include explanatory variables in models for dichotomous ‘outcome variables, For example, in the example above, the individual's age ‘will undoubtedly have a major influence on experience with cohabitation. ‘When explanatory variables are included to model probabilities, a problem is that probabilities are restricted to the domain between 0 to 1, whereas a linear effect for an explanatory variable could take the fitted value outside ‘this interval. This was mentioned already in the introduction. Tnstead of the probability of some event, one may consider the odds: the ratio of the probability of success to the probability of failure. When the probability is p, the odds are p/(1 ~ p). In contrast to probabilities, odds can assome any value from 0 to infinity, and odds can be considered to constitute a ratio scale.212 Discrete dependent variables ‘The legarithm transforms a multiplicative to i an additive scale and transforms the set of ‘eait(s) positive real numbers to the whole real line. Indeed, one of the most widely used transformations of probabilities is the log odds, defined by werin=in(<2=), ass we a) dente tat sino SoS gt a Sa graph is shown in Figure 14.2, is an increas- 7 A ing function defined for numbers between 0 and 1, and its range is from minus infinity to = plus infinity. Figure 143 shows in a different ‘way how probability values are transformed to a logit values. 0.269 is transformed to (0.982 to logit(p) = 4. 3 0.5 is exactly 0. P 095 032 oat 0.50 073088 095 a 3 2 a 0 1 2 3 logit) Figure 14.3 Correspondence between p and loit(). ‘The logistic regression model is a model where logit(p) isa linear function of the explanatory variables. In spite of the attractive properties of the logit function, it is by no means the only suitable function for transforming probabilities to arbitrary real values. The general term for such ‘transformation function is the fink function, as it links the probabilities (or more geuerally, the expected values of the dependent variable) to the Introduction to multilevel logistic regression as explanatory variables. The probit function (whichis the inverse cumulative xan = {G9 C2 cunt ects or > 4 ‘The age # = 20 is used as the reference value, so that the intercept refers to the age of 20 years. Table 14.2 contains as Model 1 the estimation reaults for this model "The coefficients ofall these functions of age cannot be interpreted in iso- lation from each other. Together they represent the following function of age, which here isthe model's fixed part: a 1.200 + 0.599 (¢ ~20) — 0.020(¢ ~20)? + 0.0241((¢ — 30)+)? + 0.0088((¢ —40)4)? , ‘where (t—0)+ is defined by (ta) fort > a and by 0 otherwise, Tis function is represented graphically in Figure 14.6 Introduction to multilevel logistic regression ar ‘Table 14.2 Estimates for two models fr cohabitation data. Model 1 Mode 2 “a eg conic “Scans 3 7p Intercept S10 81 TOS 7 Age = 0539 OSL Ose (Age 202 temo © oops 0am 0.0038 TB (Age 30)? for Age > 30 O0M1 005400235 0.055, Ti (Age. 40)? for Age > 400.0068 0.0025 0.0075 0.0005, J Raliglon 185 om Random Effet Var. Comp. SB. "Var. Comp. 8.6. = ee d= var(Uey) aos 00s 0.084 1 ° log odds 1 24 See Ea 2 0 ” ey 0 70 years Figure 14.6 Age effect on cohabitation experience (Sxed part). ‘The figure shows that the probability of experience with cohabitation swiftly increases to a maximum attained at about 30 years, after which it Gecreases again. The fitted logits at 20, 30 and 70 years are ~1.2, 1.29, and “Zor, respectively, corresponding (cf. Figure 143) to probabilities of 0.23, 278, and 0.11 ‘When age is being controlled for in this way, there remains a random region effect. The estimated variance, 0.061, corresponding to standard de ‘iatoe of the random intercepts of 0.25, is even larger than it was inthe empty ‘model. Tn Section 7.1 we remarked that this is possible for the hierarchical linear model with normal distributions, and it may be a sign of misepecifi- cation of the model. In the hierarchical linear model for dichotomous data this is not necessarily the case. In Section 14.3.5 it will be explained that ‘adding leve-one variables with strong effects will tend to increase estimated Tevel-two variances, and to make the regrewsion coeficlents of alzeady included variables, f these are uncorrelated with the newly inchided variables, larger in absolute size. Tn addition to age, the religious conviction of the individual can also have an effect on cohabitation. Many Christian religions are notin favor of cohb- itation without being maried. This a represented by including in the model218 Discrete dependent variables 4 dammy variable called ‘Religion’, with value 1 ifthe rexpoadent reported to attend religious services at leat once a moat, and valve O otherwise, There sulting estimates are given in Table 142 as Model 2 It appears that religions activity basa stroog effet on cohabitation, with an efit of 1.5 on the Jogrodls and a tavio of —185/0.24 = ~7.71 highly signiiact). Inca thls elect hardy changes the age efoct but does lad to a smaller racdom intercept variance. This is because the regions difer inthe proportion of re- lgious inhabitants. Ieluding the regional average of this Gummy variable, hich Is just the regional proportion of religious inhabitants, showed thay the withinregion region coeficient does ot ir sigifactly Bom the betwees-egion regression coefcent (parameter extluates not showa bert). 142.5 Bstimation Parameter estimation in hierarchical generalized linear models is more com- Pllcated than in hierarchical linear models. Inevitably some kind of appraxi- ‘ation is nvolved, and various kinds of approximation have been proposed. Reviews were given by Rodriguer and Goldman (1995) and Davidian and Giltinan (1995). We mention some references and some of the terms used ‘without explaining them, The reader who wishes to study these algorithms is referred to the mentioned literature. More information about the com- Puter programs mentioned in this section is given in Chapter 15. ‘The most frequently used methods are based on a first- er second-order ‘Taylor expansion of the link function. When the approximation is around the estimated fixed part, this is called marginal quas-likelood (MQL), ‘when it is around an estimate for the fixed plus the random part itis «alled penalized or predictive quas-likelihood (PQL) (Breslow and Clayton, 1993; Goldstein, 1991; Goldstein ane Rasbash, 1996). ‘This procedure is implemented in the programs MLn/MLwiN, HLM, and VARCL. A Laplace approximation, which is supposed to be more precise, was proposed by Raudenbush, Yang, and Yosef (1999). This is implemented in HLM, version 5. Other approximation methods were proposed by Wong and Mason (1985) and by Engel and Keen (1994). ‘Numerical integration is used in procedures proposed by Stiatelli, Laird and Ware (1984), Anderson and Aitkin (1985), Gibboas and Bock (1987), and Longford (1994). An example is given in Gibbons and Hedeker (1994), ‘This was extended to three-level models by Gibbons and Hedeker (1997). ‘This numerical integration method is implemented in the program MIXOR, and its relatives. An important practical advantage of this method is the production of a deviance statistic (see Section 6.2) that can be used for hypothesis testing. The deviance statistic produced by MQL and PQIL ‘methods cannot be used in this way. It may be noted that MIXOR estimates for the random part parameters the standard deviations 79 etc. rather than the variances 79, and that formula (5.15) describes how the standard erzors of the standard deviations can be transformed to those of the variances, Computer-intensive methods related to the bootstrap and the Gibbs sampler were proposed by Zeger and Karim (1991), Kuk (1995), Meijer etal. Introduction to multilevel logistic regression 219 (1995), and McCulloch (1997). Some of such procedures are implemented in MLwiN and MLA. A method based on the principle of indirect inference was proposed by Meali and Rampichini (1999). Another computer-intensive method is the method of simulated moments. This method is applied to the models of this chapter by Gourigroux and Montfort (1996, Section 3.1.4 and Chapter 5) and an overview of some recent work is given by Baltagi (1995, Section 10.4.3). ‘The estimates produced by these methods differ primarily with respect to the parameters of the random part. ‘The estimates for the fixed parameters often are not strongly diferent. But espetially if the variance components are rather large, these methods may produce quite diferent cstimates for the random part parameters, whic in cura will have an effect on the estimated standard errors for the fixed part parameters ‘The first-order MQL and PQL estimates of the variance parameters of the random part have an appreciable downward bias (Rodriguez and Gold- ‘man, 1995). The second-order MQL and PQL methods produce parameter estimates with less bias but, t seems, a higher mean squared error. ‘The numerical integration approach and the Laplace approximation seem to produce statistically more satisfactory estimates than the MQL and PQL approaches. ‘They have the additional advantage that the deviance produced is reliable, in contrast to the deviance produced by the MQL and PQL methods. But they are implemented less widely in computer software. ‘The etimation procedures for these models still are in a state of active development. ‘The choice between these procedures should be based on stability ofthe algorithm (wil the algorithm converge to a valid estimate?), statistical efficiency, availabilty of software, and the possibility to obtain parameter tests ‘The currently available algorithms are not perfectly stable; whether they will converge depends on the data set, the complexity of the Btted model, and the starting values. Small group sizes can contribute to instability of the algorithm. Of the first and second order MQL and PQU, methods, the first-order MQL method isthe most stable. 14.2.6 Aggregation If the explanatory variables assume only few values, then it is advisable to aggregate the individual 0-1 data to success counts, depending on the explanatory variables, within the leveltwo units. This will improve the speed and stability of the algorithm as well as, for some estimation methods, the statistical efficiency. This is carried out as follows. For a random intercept model with a small number of discrete explanatory variables X to X;, let L be the total number of combinations of values (ij27)- All individuals with the same combination of values (¢}, yr) are treated as one subgroup in the data, They all have a common success probability, given by (14.15). Thus, each level-two unit includes subgroups, or less if some of the combinations do not occur in this level-two220 Discrete dependent variables unit. Aggregation is advantageous if Lis considerably les than the average ‘r0up size ny. Denote by Af Gyo 8e) the nner of indivdnala in gronp j with the values #4y.ue on the re spective explanatory variables, and denote by Yi 1s) 34) the numberof individuals among these who yielded a sucess, i. for whom Yig = 1. Then Y¥}' (21,52) bas the binomial distribution ‘with binomial denominator (‘aumber of trials’) nj (21,... 2-) and success probability given by (1415), whichis the same for al individuals ¢ in this subgroup. ‘The multilevel analysis is now applied with these subgroups as the levelone units. Subgroups with n} (21,2) = O can be omitted from the dataset. 14.8.7 Testing the random intercept ‘The parameters in the fixed part of a multilevel logistic model can be tested by t-tests or Wald tests in the way indicated in Section 6.1. Testing param- ters of the random part, however, is more dificult than in a hierarchical linear model with normal distributions. If one is using a program that im- plements numerical integration (such as MIXOR, ef. Section 14.2.5) or the Laplace approximation (such as HLM version 5), the deviance can be used to produce chi-squared tests just like in Section 6.2. But the deviance values produced by the MQL and PQL methods are so crude approximations that they cannot be used in reliable deviance tests "Another test for the random intercept in the case that there are some explanatory variables was proposed by Commenges et al. (1994). When there ‘are no explanatory variables (ie., the empty model is being considered), the test statistic reduces to (14.7). This is 2 test for the aull hypothesis that there is no random intercept, ie., 72 = 0, wale controling for the fixed effects of the explanatory variables. This means that the null hypoth 4s the usual logistic regression model with explanatory variables X; to X-, while the alternative hypothesis adds to this a random intercept as in model (14.15). The test is based on the so-called score principle, which ‘means that it requires only the estimation of the parameters under the mull model. (Therefore it can be calculated from the results of fitting a logistie regression model, without random intercept, either by multilevel or by conventional (non-multilevel) software. So one might wait with turning to multilevel software unti this tet turns out to be significant.) For the details of this test we refer to the paper cited. 44.3, Further topics about mult wvel logistic regression 14.3.1 Random slope model The random intercept logistic regression model of Section 14.24 can be extended to a random slope model just like in Chapter 5. We only give Further topics about mulilevel logistic regression a the formala for one random slope; the extension to several random slopes 's straightforward. The remarks made in Chapter 5 remain valid, given the appropriate changes, and will ot be repeated here. ‘Like in the random intercept model, assume that there are r explanatory variables X; to X,. Assume that the effect of the first one, X;, is variable deros groups, and accordingly has a rainlou sloye. Expression (1415) for the logit of the success probability is extended with the random effect Ui, Ry Which leads to logit( Ps) =o + So mens + Vas + Uiszry - (4.16) ‘There now are two random group effects, the random intercept Uo) and the mandom slope Uis. It is assumed that both have a zero mean. Thelt variances are denoted, respectively, by 72 and 7? and their covariance is denoied by ma: “Testing @ random slope in multlevel logistic regresion models can be based on the deviance, provided that the deviance statistic is based on numerical integration rather than on an approximation method (see Sec tion 14.2.5). Such a deviance statistic is produced by MIXOR and HLM version § (See Chapter 15). ‘The deviance statistic eurently provided by Mn/MLwiN cannot be used for this purpose, because it is based on a ferent type of approximation. ‘Testing procedures for random slopes in Nerarchical generalized linear models are still a matter of active research, see, Lin (1999). Example 14.4 Friends and foes inthe former GDR. Pesonal networks are data about the relations of individual: each respondent ibe to mention ther persons to whom ahe or he i telat according to fone specie eterion, and to give further information about the relation ‘wth this prsos. In this context, exponent are referred toa egos’ and the Thestoned relations as ‘nominee’ or alter’ Inthe reiting data set, alters ire nested within egos. ‘The application of mutleve analysis to personal networks is dacussed in Snijders, Spreen, and Zwaagstra (1994) and in Van Dulja, Van Bussehbach, and Sader (1000). ‘Vater (1008) collected personal network data inthe former German Denoeratic Republic (GDR) and investigated changes in personal relations sociated withthe cowafall of commusiom and the Cerman reunification. ‘The data was collected retzmpectively ia 1982 and 1093. Based on these daa, Valier and Flap (1097) studied neighborhood relations in the former GDR on the bass of data about relationships exining in 1969 (before the tweaking ofthe Belin Wal). This example reanlyses data presented in the later paper, focusing on trust in relation. The quetion i, what inl ences whether an incividual ditruse another peron to whom be or she i rehted. “The level-teo unts are the rexpondens (aso called ‘egos. Fox each respondent the personal network is delineated in a standardized way, which lauds to a lit of related persons (‘akers',‘aetwork members). This Tit the group ofleverone unis, The dichotomous dependent variable indicates ‘whether ego distrusis alter. Since fmily relations and friends were almost222 Diserete dependent variables ‘neve: distrusted, these relations were omitted from the data set, whic left the following role relations: acquaintance, neighbor, colleague, superior, and subordinate, This resulted in set of 426 egos with a total of 1,683 alter, ‘The analyses were caried out using MIXOR. ‘There are two political vaiables. Membership of the communist party (SBD) for ego is dencted by Xi = 1 (yes) oF 0 (no). Tt was unknown whether alter isa party member himlf, but the variable Xz indicates whether his fanction requires party membership or is under some kind of politcal supervision with the values Xz = 1 (yes) and 0 (zo), “Model 1 presented in Table 143 includes the fxed effects ofthese variables and a random slope for X2. The fixed eects ofthe political variables are non- significan. The deviance of the model without the random: slope for Xz 1687.77, Hence the test for the random slope has x? = 5:48,d. = 2, which yields (one-sided, see Section 62.1) p < 0.05. The estimated slope variance in quite large (6.76) but its standazd error is even larger. It can be concluded that although the random slope is siguiiant, there isa large uncertainty bout the slope variance. ‘Table 14.3 Parameter estimates for two models for distrust relations, Model Mott 2 “po Tnteroept o3 Polis! sorables on Paryy member ego 0.30 024 025 025 ‘2 Politeal fmetion ter 0a 099 015 O78 ‘ps Party member x pol. fonction Zio Lor ole relations ‘ne Colleague Lis 021 ‘ Superior 1M 02 ‘Subordinate 02 090 Neighbor 231 031 Random Effect Pe. SB. Pu. 8B Fandom sero eseUed) erent vrtance 18s 037 L680 eB = var(Uzy) slope variance of paltical function 5.75 7435.54 6.27 oe = cov(Uaj,Uaj) Intereeptslepe covariance” 0.08 102022 104 Deviance 12820 1519.95 Model 2 in this table aloo includes the interaction between party mem bership of ego and political function of ater, and the fixed effect of the role relation between respondent and network member, The interaction variable is defined as Xs = X; x Xz. ‘The role relation is captured by four dummy variables derived from a catogorcal variable with five values: acquaintance, colleague, subordinate, superior, neighbor. The ‘acquaintance’ relation is used tas the reference category. The four resulting dummy variables, Xs to Xp, are 1 if the relation is in the indicated category and 0 if itis not. ‘The results for Model 2 show that the interaction between the political variables for ego and for alter is not significant. With respect to the role Further topics about multilevel logistic regression 223 relations, neighbors are distrusted most, the superiors and colleagues occupy 4 middle position in this respect, while acquaintances (which as the reference ‘category have an effect 0,0) and subordinates are distrusted least. The random Slope for alter’s political function is hardly afected by the inclusion of the interaction effect and the role relations. 14.3.8 Representation os a threshold model ‘The multilevel logistic regression can also be formulated a2 a so-called threshold model. ‘The dichotomous outcome Y, ‘success’ or ‘failure’, then js conceived as the result of an underlying nos-observed continuous vari fable, When Y denotes passing or failing some test or exam, the underlying Continuous variable could be the scholastic aptitude of the subject; whea Y denotes whether the subject behaves in a certain way, the underlying vari fable could be a variable representing benefits minus costs of this behavior ttc. Denote the underlying variable by Y.. Then the threshold model states that Y is 1 if ¥ is larger than some threshold, and 0 if it i les than the threshold. Since the model is about unobserved entities, itis not a restric tion to assume that the threshold {s 0. This leds to the representation 1 ¥>0 aan ve{) peo. C0 Forth unobserved vile 7, a usual random intercept model i assumed: Yyene + Sones + Uy + Ry quay “To representa logate regression mode, the level-one residual of the under ‘ying variable ¥ must have a logistic distribution. This means ‘that, when the level-one residual is denoted by yj, the cumulative distribution function of Ra; ust be the logistic function, ‘P(g < 2) = lgistic(=) for all z, (4.19) defined in (1411). This is symmetric probabiity distribution, so that also P(—Riy < 7) =logistie(z) for all. Its mean is 0 and its variance is 72/3 = 3.29. When itis assumed that Ras tas this dstelbution, the logistic random intercept mode 14.15 is equivalent to the threshold model dened by (14.17) and (14.18). “To reprevent the random slope mode! (14.16) esa threshold model, we define . Figs + Domneny + Vos + Usty + Ry» (14.20) where Ry has a logistic distribution. It then follows that PO¥y =1)=P (Fy >0) = (-2 <0 + Dom ams + Uo + tye)24 Discrete dependent variables = rete» + Somany +g + Uso.) Since the logit-and the logistic functions are each others inverses, the last equation is equivalent with (1416). ‘If the residual R.; has a standard normal distribution with unit variance, ‘then the probit link function is obtained. Thus, the threshold model which species thatthe anderiyng variable ¥ has a distribution according to the hierarchical linear model of Chapters 4 and 5, with a normally distributed level-one reldual coresponds exactly to the tltlevel probit regresion model. Sine the standard deviation of ay is V/=¥73 = 1.81 for the logistic and 1 for the pros model, the fixed estimate forthe logistic medal will tend to be about 281 times as big as for the probit model andthe variance parameters of the random part about 12/3 = 3.29 times as big (but see Tong (1997, p. 48, who notes that fn practice the proportionality constant for the fixed eatinates is cose to 1.7) 14.3.9 Residual intraclass correlation coefficient ‘The intraclass corelation coefcient forthe multilevel logistic model can be defined in at least two ways. The fist definition is by applying the definition in Section 3.3 straightforwardly to the binary outcome variable Yi. This approach was also mentioned in Section 14.2.1. It was followed, eg., by Commenges and Jacqmin (1994). ‘The second deinitio is by applying the definition in Section 3.3 to the unobserved underying variable Yi. Since the logistic distribution for the level-one residual implies @ variance of 22/3 = 3.29, this implie that for a ‘two-level logistic random intercept model with an intercept variance of =f, ‘the intraclass correlation is i a+ FB ‘These two definitions are different and will lead to somewhat different outcomes. For example for the empty model for the Norwegian cohabitation data presented in Table 14.1, the fist definition yields 0.00222, (0.00222~-0.244) = 0.009, whereas the second definition leads to the value 0.032/(0.082+ 3.29) = 0.010. In this case, the dilference is immaterial and both values ofthe intraclass correlation are very small. ‘An advantage of the second definition is that it can be directly extended to define the residual intraclass correlation coefficient, ie, the intraclass correlation which controls for the effect of explanatory variables. The example can be continued by moving to the two models in Table 14.2. The residual intraclass correlation controlling for age (Model 1) is 0.061/(0.061+3.20) = 0.016, contcolling for age a2 weil a celgivu (Model 2) it's 0.049/(0.049 + 3.29) = 0.015. Thus, introducing a level-ove variable increases the resid- ‘ual intraclass cornlation, while controlling for religion brings the intraclass correlation down sgain. This can be understood from the fact that religion is a levelone varizble with, however, an appreciable variation at level two. m Further topice about multilevel logistic reyession 25 For the multilevel probit mode, the second definition for the intraclass correlation (and its residual version) leads to wae! since this model fxee the level-one residual variance of the nnoheervable variable Yi to 1. 14.3.4 Brplained variance ‘There are several definitions of the explained proportion of variance (R?) in single-level logistic and probit regression models. Reviews are given by Hagle and Mitchell (1992), Veall and Zimmermann (1992), and Windmedjer (1995). Long (1997, Section 4.3) presents an extensive overview. One of these definitions, the A? measure of McKelvey and Zavoina (1975), which {s based on the threshold representation treated in Section 14.3.2, is considered very attractive in each of these reviews. In this section we propose & ‘measure for the explained proportion of variance which extends McKelvey ‘and Zavoina's measure to the logistic and probit random intercept model. Tt is assumed that it makes sense to conceive of the dichotomous out comes ¥ as being generated through a threshold model with underlying variable Y. In addition, it is assumed that the explanatory variables Xn are random variables. (This assumption is always made for defining the explained proportion of variance, see the introduction of Section 7.1.) There- fore the explanatory variables are, like in Chapter 7, indicated by capital letters. For the underlying variable ¥, equation (14.18) gives the expression Ys = 0 + Dw Xnig + Voy + Ry Denote the fixed part by Fy = 70+ Dom Xna - ‘This yarables al ale the ner peitr fr ¥. vances denoted by o}. The intercept variance is var(Uj) = 79 and the level-one residual variance is denoted var(R) = oR. Recall that of is fixed to x?/3 = 3.29 [orth late and forthe pot mode For riomly drawn levebone it in 2 randomly drew evo unit jhe eros re andomly awa fromthe cospenngpopultion ‘and hence the total variance of Yij is equal to vail) =o} + 13 + oh ‘The explained part of this variance is o} and the unexplained part is 13 + oR. OF this unexplained variation, rf resides at level two and o% at level he Mens the ppt lexan ation cn be die By =a Ree" Pag T oR (42a)26 Discrete dependent variables ‘The corresponding definition of the residual intraclass correlation, 2 a , Tia (1422) ‘was also given in Section 143.3, ‘To eatimate (14.21), one can frst vompute the linear predictor Fy using, the estimated coefficients 4o, :, «1 fr and then estimate the variance of by the observed variance of this computed variable. It is then easy to plug this value into (14.21) together with the estimated intercept variance #3 and the fixed value of 2. In the interpretation of this R? value it should be kept in mind that such values are known for single level logistic regression to be usually considerably lower than the OLS R? values obtained for predicting continuous coatcomes. ame 1.5. os mn ih Rona Lame AA Ae te 33 woe reset pr cheue my sot ss analy bc Base Taps [i fhe enor vate teyanaoe Sottero tana is enor ntaoe | hans eae oases ter nae ee ee | 220 st nega wt eos | Sian eal pais ered tie aT ae TROD RD SS mA “Noe Pac i pts ne es Sere geenishckae ae | Tie! opel wa nately te NDXOR rope i gate ebonn ah al Disa cttw sees ne see eee Seema seem eecee act ‘Table 14.4 Estimates for probability o take a science subject. Modal 1 Fa Coit “SE. pee 3a C100 ‘Gender isis ole FaMioony wasn “oT O85 Random Effet Var. Comp. SE. a ofzany) cust 082 Bovoce “gata ‘Tae Linear predictor for thia mode ia Yj 2487 1.615 Geadery ~ 0.727 Micorieyy, ‘and the variance of this variable in the sample is 63. ‘explained proportion of variation is | 2 0.382 Kisco = T5RTE Oss + 3 oa. Further topics about multilevel logistic regression zat In other words, gender and minority status explain about 13 percent of the vaiation in whether the pupil takes at least one science subject for the high school exam. "The unexplained proportion of variation, 1 ~ 0.13 = 0.7, can be written 0.481 1 329 wn yea Cast + 320 | 0582 +08 TTD b ‘which represents the fact that 11 percent of the variation is unexplained var Zion at the schoo! level and 76 percent is unexplained variation at the pupil level. The residual intraclass correlation is px = 0.481/(0.481 + 3.29) = 0.13. =onso7m= 14.3.5 Consequences of adding effects to the model ‘When a random intercept is added to a given logistic or probit regression. model, and also when variables with fixed effects are added to such & model, the effects of earlier included variables may change. The nature of this change, however, can be different from such changes in OLS or multilevel linear regression mode's for continuous variables. “This phenomenon can be illustrated by continuing the example of the preceding section, Table 14.5 presents three other models for the same data, in all of which some elements were omitted from Model 1 as presented in Table 144. ‘Table 14.5 ‘Three models for taking a science subject. Mods 2 Mode 3 Model Fine ‘SE Par. SE Par. Sp Intercept zai 0088 aad 0100 Laas bse 31 Genter 1307 0102-1507 0.02 7a Mizorty satos Hosts 0474 Random Bist Pe, SH, Pa SE Pan SE. Osi ouss 0.293 0088 3045.15 s251.86 3476.06 Models 2 and 3 include only the fixed effect of gender; Model 2 does not contain a random intercept and therefore Is a single-level logistic regression model, Motel 3 does include the random intercept. The deviance difference (2 = 3345.15 — 3251.86 = 99.29,d.f. = 1) indicates that the random Intercept is very significant, But the sizes of the fixed effects increase in absolute value when adding the random intercept to the model, both by about § percent, Gender is evenly distributed across the 240 schools, and fone may wonder why the absolute size ofthe effect of gender increases when, ‘the random school effect is added to the model. ‘Model 4 diflers from Model 1 in that the effect of gender is excluded. ‘The fixed eect of minority status in Model 4 is ~0.644, whereas in Model Lit is -0.727. The intercept variance in Model 4 is 0.203 and in Model 1 it is 0.481. Again, gender is evenly distributed across schools and across the228 Discrete dependent variables majority and minority pupils, and the question is how to interpret the fact that the intercept variance, Le, the unexplained between-school variation, rises, and that also effect of minority status becomes larger in absolute valu, ‘when the effect of gender is added to the model. ‘The explanation can be given on the basis of the threshold representation, When all fixed effects 7y and also the random intercept Uoy and the level-one residual Riy would be multiplied by the same positive constant , then the unobserved variable Yj would also be multiplied by c. This corresponds also to multiplying the variances 1? and o% by ¢. However, it follows from (14.17) that the outcome Yi would not be affected because ‘when Yi is pesitive then so is e¥i. Thi shows that the regression param ters and the random part parameters of the multilevel logistic and probit models are meaningful only because the levelone residual variance ok has ‘been fixed to some value, and that this value it more or less arbitrary. ‘The meaningful parameters in these models are the ratios between the ‘regression parameters Ya, the random effect standard deviations m (and possibly 7, ete.) and the level-one residual standard deviation on. Armed with this knowledge, we can understand the consequences of adding a random intercept or a fixed effect to a logistic or probit regression model. ‘When @ single-level logistic or probit regression model has been est- mated, the random variation of the unobserved variable ¥ in the threshold model is of. When subsequently a random intercept is added, this random variation becomes of + 7g. For explanatory variables that are evenly dis- ‘tibuted between the level-two units, the ratio of the regression coefficients to the standard deviation ofthe (unexplained) random variation will remain approximately constant. This means that the regresion coeficlents will be smultiplied by about the factor Jn the comparison between Models 2 and 3 above, this factor is /i + (0.514/3.29) = 108. This is indeed approximately the number by ‘hich the regression coefcients were multiplied, when goig from Model 2 to Model 8. 1 can be concinded that, compared to single level logistic or probit regression analysis, including random intercepts tends to increase (in abso- Ite value) the regression coefficients. In the biostatistical literature, this js known as the phenomenon that population-averaged efecs (ie, effects in models without random effects) are closer to zero than cluster specific effects (which are the effects in models with random effects). Further discussions can be found in Neuhaus et al. (1991), Neuhaus (1992), and Diggle et al. (1994, Section 7.4). Now suppose that a multilevel logistic or probit regression model has been estimated, and the fixed effect of some level-one variable X41 is added to the model. One would think that this would lead to a decrease in the 29 level-one residual variance o}. However, this is impossible as this resid- (STvarace ofan 90 that fsa he ena ofthe other eran oafsets wil tend vo become erin abacite anand the intercept ‘Senne (tad slope variance ay) willo ton to become larga. Ute Ieetone sabe Xr lamers withthe oter includ xed ects SOS Penlyeitiuted ecw the leh evo its (the tals SSitaton f yy about ni) then the regenoneoficentsy and {GeNtndnd devon (te) ofthe rand efits wil l ince by ‘Soar the sane ctor: Covelaon betwen ry and tbr vrs or peste iran coraton of Ky may darts pattern toa grater emule exert "Taisen hat the fect of minoy stata and theatre vasiance increta gong tom Std to Modal. Tho standard deviation of te “Soom nterepteeenes by apr factor than the ere oefcent ‘Wiidocy sats, however. Tis might be elated toan interaction bewenn eee tender and no Sato and toa wry een station of the sts aces the sacl (Scio 71) Ti wm cn bu cna in the exalt he Norwegian cabbie on date when gogo the empty model im Table 41 to Medel 1 in {ahs 142" Thelove ove ratalesadced age and nooner tanfamar Tan fag) led tos tatal incre in te ateept Ysance (ome O10 to Gol), Ths plained by the angmat above” That adding oti ehun pung rom Noel to Moda Pin Tale 142 dow ot ead to State increase inthe intrcept varance mit be de tothe fact that {Ee Stponion of the ein Creo ets) wh repet to raion 14.8.6. ibogaphic remark Faadom effews model for binary data have been std by many author ‘Andergon and Atkin (088), Stell, Lala, and Ware (1984), Wong and Maso (188), and Longford (1904, based on is eater pape) wee among thodr. to propos tiation methods fo the loge oral model treated in Setion 122, Gibbons and Bock (1987), MeCilloc (1864, and Ochi and Prete (984) proposed probit-normal model. A cone review of work 1p 101000 gen by Searle tal. (962, Chapter 10 Further relerences fan be found fn these papers and in Section 1425, A review of fixed and rendom coefiient modes for discrete data % jen by Hamere and Roing (1995). General mltlevel model for 2om- Scrmaly dotibuted variables vere considered, eg, by Coldstein (191). ‘A textbook that treats yaious approaches to estimation and contains many further reerences i Daviian and Gisaan (198) 14.4 Ordered categorical variables Variables that have as outcomes @ small number of ordered categories are quite common in the socal and biomedical sciences. Examples of such vari-230 Discrete dependent variables ables are outcomes of questionnaire items (with outcomes, eg, ‘completely disagree’, ‘disagree’, ‘agree’, ‘completely agree’), a test scored by a teacher as fail, ‘satisfactory’, or ‘good!, etc. This section is about multilevel models where the dependent variable is such an ordinal variable. ‘When the number of categories is two, the dependent variables dichotomous and Section 14.2 applies. When the aumber of eategories is rather large (5 or more), it may be possible ta approximate the distribution by a normal distribution and apply the hierarchical linear model for continuous outcomes. The main issue in such a case is the homoscedastcity assumption: is it reasonable to assume that the variances of the random terms in the hierarchical linear model are constant? (The random terms in a random intercept mode are the level-one residuals and the random intercept, Ry and Uy; i2 (4.7),) To check this, itis useful to investigate the skewness of the distribution. If in some groups, or for some values of the explanatory rariables, the dependent variable assumes outcomes that are very skewed toward the lower or upper end of the scale, then the homoscedasticity assumption is likely to be violated. the number of categories is small (8 or 4), or if itis between 5 and, say, 10, and the distribution cannot well be approximated by a normal distribution, then statistical methods for ordered categorical outcomes can be wsefl. For single-level data such methods are treated, eg. in McCullagh ‘and Nelder (1989) and Long, (1997). It is usual to assign numerical values to the ordered categories, taking into account that the values are arbitrary. To have @ notation that is com- patible with the dichotomous ease of Section 142, the values forthe ordered categories are defined as 0,1,..¢~ 1, where ¢ is the mumber of categories. ‘Thus, in the four point scale mentioned above, ‘completely disagree’ would set the value 0, ‘disagree’ would be represented by 1, ‘agree’ by , and ‘completely agree’ by the value 3. The dependent variable for level-one unit # in level-two unit j i again denoted ¥,;, so that Yiy now assumes values in the set {0,1,2,.6~ 1} A very useful model for this type of data is the muilevelondered logistic regresion model, also called the roulilevel ordered logit model or the multlevel proportional odds model; and the closely related multilevel ondered pro- Bit model. These models are discussed, eg, by Agresti and Lang (1993), Gibbons and Hedeker (1904), Hedeker and Gibbons (1994), and Goldstein (1995, Section 7-7). A three-level model was discussed by Gibbons and Hedeker (1907). They can be formulated as threshold models like in Sec- tion 143.2, now with ¢~ 1 thresholds rather than one. The real line is divided by the thresholds into ¢ intervals (of which the fist and the last have infinite length), corresponding to the c ordered categories. The first thresh ld is @ =O, the higher thresholds are denoted 01, 65.» @o-2- Threshold %, defines the boundary between the intervals corresponding to observed outcomes k and k+1 (for k = 0,..,¢~9). The assumed unobserved underlying continuous variable is again denoted by Y and the observed categorical Oniered categorical variables 21 variable Y is related to ¥ by the ‘measurement model’ defined as 0) HF SH Ba) ema ty < or Yo) pita c Pea (eH a, 042) el fa 0.10). Thus, the weekdays and months as used in Model 2 are sllcieat to explain the variability between the dally nuber of calls oxtaide office hours. eg 15 Software ‘Almost all procedures treated in this book can be carried out by standard software for multilevel statistical models. This of course is intentional, since this book covers those parts ofthe theory ofthe multilevel model that can be readily applied in everyday research practice. This chapter briefly reviews the available software in three sections. ‘The frst one is devoted to special purpose programs, such as HLM and MLwiN, specifically designed for multilevel modeling. ‘The second section ‘treats modules in general purpose software packages, such as SAS and SPSS, that allow for (some) multilevel modeling, SAS has full-fledged possibilities for multilevel modeling. SPSS and BMDP have very limited possibilities for multilevel analysis, but can help absolute beginners inthis fld to learn to apply the basics of multilevel modeling, mainly the techniques treated in Chapter 4: random intercept models, but no random slopes. BMDP includes, in addition, special modules for longitudinal models of the kind treated in Chapter 12. Stata has some possiblities, especially for tests of fixed effects that take into account the two-level structure. The third section, finally, mentions some specialized software programs, built for specific research purposes. Details on the specialized multilevel software packages can be found via ‘the links provided on the Multilevel Models Project homepage, with internet adress hetp//wuwioe.ac.uk/mutileve/ and via its mirror sites (with identical contents) with addresses http://www.medent.umontreal.ca/muttilevel/ and hetp://worw.edfac.unimelb.edu.au/muttilevel/. A review of many mult level programs is at http://www stat.uca.edu.“deleeuw /sofware.pdf, "To keep things simple, we only use two-level examples in this chapter. Generalizations to more levels are straightforward, and the interested reader can find enough support in the manuals which accompany the packages. Commands to reproduce some of the examples in this book using HEM, MLn/MLwiN, MIXOR, and MIXPREG can be found at the web site, hetp//stat.gamma.rug.ni/snijders/mutilevelhtm. 15.1 Special software for multilevel modeling Multilevel software packages aim at researchers who specifically seek to apply multilevel modeling techniques. Other statistical techniques are not available in these specific multilevel programs (leaving exceptions aside). 239240 Software Of these programs, only MLn/MLwiN currently contains facilities for data manipulation. Each of the programs described in this section was designed by pioneers in the field of multilevel modeling. 15.11 HEM HLM was written by Bryk, Raudenbush and Congdon (1996), and the the- ‘retical background behind most applications can be found in Bryk and Raudenbush (1992). ‘The main exception is the case of discrete dependent variables which is covered in the program but not in the textbook. The ‘main features of HLM are its interactive operation (although one can ran the program in batch mode as well) and the fact that it is rather easy to learn. Therefore it is well suited for undergraduate courses and for post- graduate courses for beginners. The many options available make it 2 good tool for professional researchers as well. Information is obtainable from the web site, http://www ssicentral.com/lm/mainhlm.htm, which also features a demo version. Tnput consists of separate fles for each level in the design, linked by common identifiers. In a simple two-level case, eg., with data about students im schools, one fle contains all the school data with a schoo! identification code, while another fle contains all the student data with the school identification code for each student. ‘The input can come from system files of SPSS, SAS, SYSTAT or STATA, or may be in the form of ASCTI text files (Once data have been read and stored into a sufficient statistics le (a kind of system fle), there are three ways to work with the program. One way is to rn the program interactively (answer to questions poted by the program). ‘Another is to run the program in batch mode. Batch and interaction modes also can be combined. And finaly, since there is a Windows version, one can make full use of the graphical interface. In each case the two-step logic of Section 6.4.1 is followed. We present the main commands for analysing ‘2 simple two-level example with two explanatory variables, X at level one and Z at level two. The model is Yig = Bog + Bag ig + Ry aj = "7 + a1 #5 + Uo Asano + my + Uy ‘which of course can also be written in one equation as, Ys = "00 + Notes + Yor 2g + 45TH ay Uig + Ung + Ry (asa) ‘This example will be used throughout this chapter, sometimes with some other explanatory variables added to the data set. In the example of the HLM commands, we present the commands at the left and our comments at the right (Of course the first thing is to invoke the program. Special software for multilevel modeling 241 HLM2L daca.sem commandfile data.asm costaing the sufficient statistics and comandfile - a zname which should be replaced by ‘your own file name ~ contains the ‘commands. ‘The basic part of the command file then could be as follows. the level-one model: Y is dependent, regresses on an intercept and on X, and there is a random resid- Leveli:ysinereptitatrandon ual; Level2:intrepti=intrept?+Z+randoa one part of the level-two model: the intercept of the first equation, os, Js regressed on a general intercept and on a level-two variable Z, while it also has a random corm ‘ponent; so this model contains a random intercept; Level2:xeinterept2+Z+randon the second part of the level-two model: the regression coefficient Bay of X on Y ia itself regressed on an intercept (which is the fixed part of the effect of X on Y), a level-two variable Z, and it has a random component, so this mode! also contains a random slope. In interactive mode, however, one does not need to know the syntax and ‘one would simply answer to questions like (for didactical purposes we only take the most relevant questions): The choices ar For Yenter 1 For X enter 2 For GENDER What is the outcons variable: Which Les -1 predictors do you wish to use? The choices ar For Xenter 2 For GENDER enter 3 Leve! 1 predictor? (Enter 0 to end)242 Software Which level-2 predictors do you wish to use? ‘The choices are: For SIZE enter1 For SECTOR enter 2 For BUDGET enter 3 For Z enter @ mich level-2 predictors to model INTERCPT 7 Level-2 predictor? (Enter 0 to end) Which level-2 predictors to model X ? Lovel-2 predictor? (Enter 0 to end) Do you want to constrain the variances in any of the ‘evel-2 random effects to zero? ‘The program assumes that you are potentially interested in any cross-level interaction effects: it asks for level-two variables that may be predictors of the X-effect on the dependent variable Therefore you do not need to create product terms before running the analysis. If you are interested, however, in an interaction effect of variables defined at the same level, then you need to caleulate the product beforehand, and put this new variable in the dataset. Moreover, unless you answer ‘No’ to the last question (‘do you want to constrain the variances”), all level-one predictor variables are assumed to have random slopes. Once one has estimated the model, the HILM session is closed. ‘The Windows version of HLM gives the program a graphical interface. It only helps to construct the command file, but it does this in a very convenient way indeed. In the Windows interface, one does not leave the HLM session after having fitted a model. HLM does not allow for data manipulation, but both the input and ‘output can come from, snd can be fed into, SPSS, SAS, SYSTAT, or STATA. HLM does not go beyond three levels. It can be used for practically all the analyses presented in this book. Almost all examples in this book can be reproduced using HLM version §. Some interesting features of the program are the possibility to test model assumptions directly, eg, by test (9.4) for level-one hetercocedasticity, and the help provided to construct contrast tests. Furthermore the program routinely asks for centering of predictor ‘variables, but the flipside of the coin is — in case one opts for group mean centering - that group means themselves must have been calculated outside HLM, if one whishes to use these as level-two predictor variables, ‘A special feature of the program is that it allows for statistical meta- analysis of research studies that are summarized by only an effect size es UUmate aad ity ansuciatel sleadasd exvor (called in Bryk and Raudenbusb, 1092, a ‘V-kaowa problem’). Other special features added to version 5 are the analysis of data wiere explanatory variables are measured with error (explained in Raudenbash and Sampson, 1999) and the analysis of multiply imputed data (see, eg, Rubin, 1996). Special software for multilevel modeling 3 15.1.2 Min / MiwiN MLn and MLwiN are the most extensive multilevel packages, written by researchers working at the multilevel models project at the London Institute of ‘Education (Rasbash and Woodhouse, 1995; Goldstein et al., 1998). Current information can be obtained form the web site http://wwwioe.ac.uk/miwin/. MLwiN is based on the older DUS program MLn. it has a Windows-95 interface and current methodological developments are added to MLwiN, not to MLn, Almost all examples in this book can be reproduced wing. ‘MLn/MLwiN. For heteroscedastic models, the term used in the MLn/MLwiN documentation is ‘complex variation’. For example, level-one heteroscedasticity is complex level-one variation ‘A.nice feature of both MLn and MLwiN is that the programs make use of ‘statistical environment (NANOSTAT) which allows for data manipulation, ‘graphing, simple statistical computations, file manipulation (like sorting), ‘tc. Data manipulation procedures include some handy procedures relating, to the multilevel data structure. Input for MLn is one ASCII text fle that contains all the data, including the level-one and level-two identifiers. This dataset is read and put in a ‘worksheet', a kind of system file. This worksheet, which can include ‘model specifications, variable labels, and results, can be saved and used in later sessions, There are again three ways to work with the program: in fan interactive command mode, in batch mode, or interactively using the ‘graphical interface of Windows 95. The interactive and batch modes are ‘operated from the ‘command interface’ within the Windows interface. ‘Tostart with the first mode: the program does not ask any question, the user should know the commands he or she wants to feed into the program. What is interactive, however, is the answer to the command which prompts up immediately on the screen. To show how this works we construct the same model as we did with HLM, represented in (15.1). We start in this case from the point that the data have been read, and labels have been assigned to variables Ten 2 ‘school’ the identification for the level-two units is found in the variable ‘abeled school; Iden 1 "pupil? identifier for the level.one unit Resp declare Y as the depeadent variable; Expl declare the variable cons as an explanatory ‘variable; this variable consis equal to 1 for every level-one unit, and its use is a trick to declare the intercept; Expl 'x? declare the variable X as an explanatory variable; Expl "2? declare the variable Z as an explanatory ‘variable; whether Z's a variable defined at level one or level two does not matter for the commands or somputations;m8 Software calc ciie'x'w2? calculate a new variable, the product of X and Z (a cross-level interaction); give this erows product the name ‘x «2' declare this cross product as an explana tory variable; the intercept is random at level one, which is another way to express that the model contains a random residual at level one; the intercept is random at level two as wells Name eit °x+2? Expl 'xe2? Setv 1 ‘cons? Setv 2 ‘cons’ Set 2x? the effect of X on Y is random at level two, 0 this a random slope model; Sett show the model settings; Som show the nesting structure; Star start the estimation procedure; Fixe report the estimates for the fixed part; Rand report the estimates for the random part; Like report the deviance. ‘A drawback of running the program in this command mode is that the user should know beforehand what he or she is going to do and which of the (more than 200) commands will do the tricks. The manuals that go with the software, however, are written in such a form (leading users through worked examples) that one easly learns to handle the most important commands, Moreover one can run parts of the program (like setting up the model) in batch form to save oneself the tedium of typing the same command over and over again. The batch file then has to contain the same sequence of commands as shown above. Since one does not leave the session after having fitted a model, one can freely change between working in command mode, ‘through the interactive Windows interface, and in batch mode. ‘The graphical interface in the Windows 95 version MlwiN’ makes it possible to build models interactively, without knowing the commands. In this case the commands are available on the screen, as are the variables to be included in the model. It is still possible to give commands or batch com- ‘mands in a dialog box. For some purposes, like transformation of variables (including the calculation of group means, within-group standard deviations, etc), it is necessary to use the command mode. MlwiN makes full use of the possibility that Windows offers, lke immediate graphical display, setting of coeflicients by double-clicking them, drag-and-drop, etc. ‘MLn and MLwiN are very flexible software programs, also because the ‘user can create his or her own macros (batch files), and a series of macros (batch files) is delivered with the software, eg, for fitting models with ordered categorical outcomes and models with cross-classfication. Since data manipulation can be carried out within the program itself one can quite easily create data sets for multivariate multilevel models. The worked ex- ‘amples, moreover provide the user with clear guidance how to do specific Special software for multilevel modeling 45 things like creating residuals, setting up contrast tests, etc. Of the available multilevel packages, MLwiN is the most flexible one, but it may take some time to get acquainted with its features. It is an excellent tool for professional researchers and statisticians. 15.1.3 VARCL VARCL contains two programs, VARLS and VARL9, written by Longford (19938). VARL9 allows to model hierarchically structured data with up to nine levels, but is restricted to models in which all slopes are fixed, ‘whereas VARLS goes up to three levels and does allow for random slopes. ‘The software is very easy to handle, being completely interactive, and one cean fit multilevel models quite easily by just answering a limited series of questions. "The input for VARCL contains several sets of data: one dataset for each level (well sorted) and a ‘basic information file’. The latter contains file format information, variable labels, an indication for the measurement level of variables, and an awkward part (a count of case numbers) that defines Which series of cases from the level-one data file belong to one unit in the level-two data file. ‘The logic of VARCL is that one starts with defining a ‘maximal model’, Which means that one defines a dependent variable, and in addition one takes from the dataset the potentially interesting predictors. If one wishes todo so, one can save this file (as asort of system file.) After this, the model fitting can start. At this stage one should correctly know the numbering of the variables (the intercept is defined by default as the frst variable by tthe program itself), and either by picking variables, including them all, or ‘exclading some, the model is sperified. The crucial part of the program is shown below. THE MODEL 70 BE FITTED VARIABLE NESTING No. k NAME CATEGORY IN/OUT/CONSTR LEVEL, 4G, MEA mH 1 5X m1 1 44° GENDER mH 1 AT Xz mW 1 18 2 mW 2246 Software RANDOM PART - LEVEL 2 VARIABLE No. & NAME 10 G.me 4 5 Xr ar) ‘The last lines give the representation of the specification of the random part at level two in the form of a lower triagonal matrix, where the entry ‘'P indicates that the corresponding element of the covariance matrix is included in the model, while the entry ‘0’ indicates that it is aot included. ENTER: © I . . INCLUDE A VARIABLE E |). ERCLUDE A VARIABLE Xo... EXIT AND LIST THE VARIABLES In this stage one defines the fixed part of the model. The random part can be specified following the question: ALTERATIONS OM THE LEVEL 2 SCHOOL ENTER: I INCLUDE A VARIABLE Bo... EXCLUDE A VARIABLE X12 MO (MORE) ALTERATIONS OW THIS LEVEL Now one can define variables that should have random slopes. Although VARCL is very easy to learn, it has drawbacks in the awkwardness of the basic information file and the impossibility of data manipulation. The dataset should be well constructed (e.., product terms reflecting interaction effects) before one starts. Moreover, the only output is a text fle which contains the regression equation, the standard errors, the variance components and their standard errors, and the deviance. Included in this file may also be the posterior means (Empirical Bayes level-two residuals). Facilities for model checks are very restricted: level-one residuals are not available, and to use the level-two residuals one must edit the textfile that contains them. Many of the examples presented in this book could have been analysed in VARL3/VARLA, including models for binary outcomes and count data (except for the deviances for the discrete outcome models). ‘The program is designed in a very friendly way, so that it is ideal for idactical purposes. During the last ten years, however, the program has not gone through any evolution, which implies that especially the estimation cof models for discrete outcomes does not meet the latest developments. Special software for multilevel modeling uz 151.4 MIXRBG, MIXOR, MIXNO, MIXPREG D. Hadeter has constructed several modules for mltlevl modeling, ocat fag at special models tht go beyond the basic hierarchical neat model ‘Windows vanions of these programs are ely avalable with manual and ‘examples at http://www.uic.edu/"hedeker/mix.html. PowerMac and Sun/ Solaris versions are available at http://www.stat.ucla.edu/“deleeuwmixfoo. "Thew programs have no lacie for data manipulation and we ASCIL text fe for data input. The algorithms are based on numeral integration. This contrasts withthe Taylor expansions which are the bai for the algorithms for crete outcome variates wed inotber muleve com puter progam, The diferace is dseweed in Section 1425. Currently, The only publely avaiable programs for multilevel analy of dsreteout- comes that provide rete lltethood statistic Tor deviance tts are {hewe programs, aed for dichotomous variable oaly by HLM venion 5. Tn te interpretation ofthe output of thse progras, it ahould be noted that the parametrization used forthe random part tot the covariance mari bu ila Cholesky decompeotion. ‘This owe rang matrix C withthe propery that CO" ~ Dy where Zs the covariance matrix. The Sutput doe alo pve the estimated wane and covariance parameter, fut nthe preset verons) at thr standard ers. you want to KIOW these standard erm, some addtional calculations ae necessary ifthe random port only contain te random latercept, the parameter in the Chole docompenion isthe standard deviation ofthe intercept. The tandandexor of the intercept yasiance canbe calculated wth formal (5.15). For other parameters, the relation between the standard errors is Store complcatd. We treat oly the eas fone random spe. The level Go covarianoe matrix then om t-(3 3) with the Cholesky decomposition denoted by eae) ‘The correspondence between these matrices is aareras Tox = €o0 cor Raat. Denote the standard errors of the estimated elements of C by soo, for, and ‘41, respectively, and the correlations between these estimates by Tenor» Toots and Toyai~ These standard errors and correlations are given in the output of the programs. Approximate standard errors for the elements of the covariance matrix of the level-two random part then aze given by SEL(#2) © 2e00 500 SE.(for) & Vai + Ho + Bein cor fo 41 Foon SE) & 2VQi 8 + Goh + ear cn Sox 1 Foun «248 Software For the second and further random slopes, the formulae for the standard errors are even mere complicated. These formulae can be derived with the multivariate delta method, explained, eg, in Bishop, Fienberg, and Holland (1975, Section 14.63). MIXREG (Hedeker and Gibbons, 19960) is a computer program for mixed effects regression analysis with autocortelated errors. ‘This is suitable especially for longitudinal models (see Chapter 12). Various correlation patterns for the level-one residuals Ry are allowed, such as autocorrelation or moving average dependence. ‘MIXOR (Hedecer and Gibbons, 1996a) provides estimates for multilevel, models for dichotomous and ordinal outcome variables (cf Chapter 14) al lowing probit, logstic, and complementary log-log link functions. These models can include multiple random slopes. The present version (number 2) also permita right-censoring ofthe ordinal outcome (weful for analysis of tultilevel groupec-time survival data), non-proportional odds (or hazards) for selected covariates, and the possibility for the random effect variance terms to vary by groups of level-one or level-two units. This allows estimation of many type of Item Response Theory models (where the variance parameters vary by the levl-one items) as wel as models where the random tHlets vary by groups of subjects (eg, males versus females). MIXNO implenents » multilevel multinomial logistic regression model. {As such, it can be used to analyse two-level categorical outcomes without an ordering. As in MIXOR (version 2), the random effet variance terms can also vary by groups of level-one or leveltwo units. This prograsm has an extensive mantal with various examples. Finally, MIXPREG is a program to estimate the parameters ofthe mult level Poiseon regression model (treated in Section 14.5). This program also has an extensive manual with examples, 15.2 Modules in general purpose software packages Several of the main general purpose statistical packages have incorporated ‘modules for multilevel modeling. These usually are presented as modules for mixed models, rardom effect, random coefficients, or variance components. As it i, most researchers may be used to one of these packages and may feel reluctant to learn to handle the specialized software discussed above. Especially for these people the threshold to multilevel modeling may be lowered when they can stick to their own software. One should bear in ‘mind, however, that many (maybe even most) of the things treated in this ‘book cannot be done in these general purpose packages, with the exception of SAS. Moreover, the algorithms used in these general purpose programs are not necessarily very cficiont for the apecfic ease of multilevel models 15.2.1 SAS, procedure MIXED ‘The SAS procedure MIXED has been up and running since 1996 (Littell ct al, 1996). For experienced SAS users a quick introduction to multilevel Modules in general purpose software peckages 29 modeling using SAS can be found in Singer (19982, 1998b). Verbeke and Molenberghs (1997) present a SAS-oriented practical introduction to mixed linear models, i, the hierarchical linear model. Unlike the other general purpose packages SAS allows for fitting very complex multilevel models and calculating all corresponding statistics (like deviance tests). FRO MIXED is a procedure orknted toward general mixed linear models, and allows to analyse practically all hierarchical linear model examples for continuous outcome variables presented in this book. ‘The general mixed model orientation has the advantage that crossed random coefficients ca be easily included, but the disadvantage that this procedure does not provide the specific efficiency for the nested random coefficients of the hierarchical linear model that is provided by dedicated multilevel programs. ‘Available are also the macros GLMMIX and NLINMIX, that allow for fitting the models for discrete outcome variables described in Chapter 14. “To get a basic idea of how the procedure works we take a bit of a syntax example from Singer (1998a, p. 8) (slightly altered for didactical parposes), used to fit model (15.1). invoke the procedure; the random facto: or level-two identifier is school, Model y = xz St the model with y as dependent and x92 /solution z, Zand 2x Z as predictors; the inter- opt is implied by default; Doth the intercept and the predictor & have random effects, s0 this is a ‘model with random intercept and random slope; level-one units are nested within (‘sub’) level-two units icentified by school; ex timate an unrestricted (‘an’) level-two covariance matrix. random intercept x/ sub-school typerun; 15.22 SPSS, command VARCOMP ‘The most simple introduction to multilevel modeling for SPSS users is the ‘module (in SPSS-language: ‘command”) VARCOMP, which is available in the latest versions (e., SPSS 7 for Windows). One can get no further, however, than the random intercept model, a described in Chapter 4 of this book, with the possibility also to include crossed raadom effects (Chapter un). Tin analysis of variance terme, the module allows to select from the list of variables the dependent variable, the factors (ie, categorical explana tory variables), and to specify for each of the factors whether it has fixed or random effects. To specify a random intercept model the random factor will be the identification code for the level-two units, ¢.g., schools or neigh-250 Software ‘borhoods. Continuous covariates can be added to the mode, but only with fixed eects. "A possible use fr researchers might be to find out whether an employed two-stage sampling strategy has caused design effects (Le., whether there is substantial variation between the leve-two units so that ignoring ths leads to standard errors in onelevel modes being highly underestimated), even ‘when controling for relevant covariates. Bear in mind that VARCOMP, as is implied by the name of the command, only estimates variance components and'not random slopes. 45.2.9 BMDP-V modules BMDP provides a series of modules (BMDP-3V and following) that allow to doa bit of multilevel modeling. Since BMDP also has a Windows vers ‘with a graphical interface, the accessibility of the multilevel modeling procedures is quite high. Documentation is given by Dixon (1992) and Dixon and ‘Merdian (1992). The relevant modules of BMDP allow to do the following kinds of analysis. BMDP-3V allows, like the SPSS command VARCOMP, to fit models with nested and crossed random effects and with fixed effects of covariates. ‘This implies that the basic models of Chapters 4 and 11 can be estimated, bout random slopes cannot be included. BMDP-4V and BMDP-5V are intended for longitudinal models, as described in Chapter 12. BMDP-5V includes the possibility to fit random slope models and more complicated residual covariance matrices, like those ‘with autocorrelated level-one residuals. Although BMDP-BV is intended {for longitudinal deta, its facility for random slope models can also be used more generally to estimate models with one random slope: use the variable with the random slope as the time variable. 152.4 Stata Stata (StataCorp, 1997) contains some modules that permit the estimation of certain multilevel models. Module loneway (“ong oneway") gives estimates forthe empty model ‘The xt series of modules are designed for the analysis of longitudinal data (cf. Chapter 12), but can be used (ike BMDP-5V) for analysing any two-level random intercept model. Command xtreg estimates the random intercept model while xtpred calculates posterior means. Commands xtpois and xtprobit, respectively, provide estimates of the multilevel Pois son regression and multilevel probit regresion models, These estimates are based on the to-called generalized estimating equations method. ‘A special feature of Stata is the so-called sandwich variance estimator, also called robust or Huber estimator. ‘This estimator can be applied im many Stata modules that are not specifically intended for multilevel analysis. For statistics caleulated in a single-level framework (eg. estimated OLS regression coeficients), the sandwich estimator when using the key- word ‘cluster’ computes standard errors that are asymptotically correct Other mubilevel software 251 under two-stage sampling. In terms of our Chapter 2, this solves many Instances of ‘dependence of a nuisance’, although it does not help to get @ rip on ‘interesting dependence’. 15.3 Other multilevel software ‘There are other programs available for special purposes, and which can be useful to supplement the software mentioned above. 15.3.1 PinT PinT is a specialized program for calculations of Power in T wo-level designs, implementing the results of Snijders and Bosker (199). This program can be used for a priori estimation of standard errors of fxed coeflicents. This is useful in the design phase of a multilevel study, ax discussed in Chapter 10 of this book. Being shareware, it can be downlosded with the manual from http://stat.gamma.rug.nl/snijders/multilevel hte, 15.3.2 Mplus ‘A program with very general facilities for covariance structure analysis is Mplus (Muthén and Muthén, 1998). Information about this program is available at http://www StatModelcom. This program allows the analysis of univatiate and multivariate two-level data not only with the hierarchical linear model but also with path analysis, factor analysis, and other structural equation models. Introductions to this type of model are given by Muthén (1994) and Kaplan and Elliott (1997). 153.3 MLA ‘The MLA program (see Meljer etal, 1995) can caloate estimates for two- level models by resampling methods. Various bootstrap implementations are included (based on random drawing from the data set or on random draws from an estimated population distribution) and also implementations of the jackknife (based on deletion of caves). It can be downloaded from http://w fo eidenuniv.n/wwen/w3.ment/medewerkers/busing/mia.htm. 15.3.4 BUGS |A special program which uses the Gibbs sampler is BUGS (Gilks et all, 1996). Gibbs sampling is @ simulation-based procedure for calculating Bayesian estimates, ‘This program can be used to estimate a large variety of models, including hierarchical linear models, possibly in combination with models for structural equations and measurement error. How- fever, it requires balanced data (je. groups of equal sizes). It is mainly lused as a esearch tool for statisticians, and avalable with manuals frou beep: mec-bsu.cam.ac.uk/bugs/.References ‘Agresti, A. (1990) Coteporical Data Analysis. New York: Wiley. ‘Agreati, A. and Lang, J. (1998) ‘A proportional odds model with subject-specific ‘ellcts for repeated ordered categorical responses’. Biometrika, 80, 527-634. Aitkia, M,, Anderson, D, and Hinde, J. (1981) ‘Statistical modeling of data on ‘eaching styles Journal ofthe Royel Statistical Society, Ser. A, 144, 419— 461, Aitkin, M. and Longford, N, (1986) ‘Statistical modelling isaues in school effectiveness studies’ (with discussion). Journal of the Royal Statistical Society, Ser. A, 149, 1-43. Alker, HLR. (1969) ‘A typology of ecological fllaci (eds), pp. 69-86. Anderson, D. and Aitken, M. (2985) ‘Variance components models with binary response: Interviewer vatiablity’. Journal of the Royal Statistical Society, Ser. B, 47, 208-210, Arminger, G., Clogg, C.C., and Sobel, MLE. (1995) Handbook of Statistical Mod- ‘cing for the Social and Behavioral Sciences, New York: Plesum Press. ‘Avkinaon, A.C, (1985) Plots, Transformations, ond Regression. Oxford: Claren- don Press Baltagi, BH. (1995) Bconometric Analysis of Ponel Data, Chichester: Wiley. Berkbof, J. and Snijders, T.A.B. (1998) "Tests fora random coeficient in mulilevel ‘models’, Submitted for Publication. Bishop, Y.MM., Fienberg, S., and Holland, P.W. (1975) Discrete Multivariate ‘Analysis: ‘Theory ond Practice. Cambridge, Mass: The MIT Press. Box, GE-P,, Huster, W.G., and Hunter, 1S. (1978) Statistics for Baperimenters. ‘New York: Wily. Brekelmans, Mand Créton, H. (1998) Interpersonal teacher behavior throughout the career’. In; Wubbela, Th. and Levy, J. (ds), Do You Know What You Look Like? Interpersonal Relationships in Education, London: The Falmer Press, pp. 46-55. Breslow, NE. and Clayton, D.G. (1903) ‘Approximate inference in generalized linear mixed models’, Journal of the American Statistical Assocation, 88, 9-35. Bryk, AS. and Raudenbush, SW. (1992) Hierarchical Linear Models, Applizntions ond Dota Analysis Methods. Newbury Park, CA: Sage Publications. Bryk, AS, Raudenbush, SW., and Congdon, IVT. (1996) HLM, Hierorchical ‘Linear and Nonlinear Modeling with the HLM/81 ond HLM/SL Programs. Chicago: Scientific Sofware International Burstein, L. (1980) ‘The analysis of multilevel data in educational research and evaluation’. Review of Research in Education, 8, 158-238 "Ia: Dogan and Rollan 252 References 253 Burstein, L., Lina, RL, and Capel, FJ (1978) ‘Analyzing mukilevel data in the presence of heterogeneous within-class regressions’. Journal of Educational Statistics, 8, 947-383, Cameron, A.C., ad Trivedi, PK. (1098) Regression Analysis of Count Date ‘Cambridge: Cambridge University Pres, ‘Chow, G.C. (2984) Random and changing coefiient models. Tn: Z. Griliches and ‘MCD. Intrligator (eds), Handbook of Beonometrics, Volume 2 Amsterdam: North-Holland Cochran, W.G. (1977) Sampling Techniques, 3d eda. New York: Wiley Cohen, J. (1988) Statistical Power Analysis Jor the Behoviorl Sciences. 2nd eda, Hilladale, NJ.: Erlbaum. Coben, J. (1992) ‘A power primer’. Peychological Bulletin, 112, 169-199. Coben, M, (1998) ‘Determining sample sis for surveys with data analyzed by Tierarchical near models". Journal of Official Statistics, 4, 267-275, Coleman, JS. (1990) Foundations of Social Theory. Cambridge: Harvard Univer- sity Pres. Commenges, D. and Jacqimin, H. (1994) “The intraclass correlation coeficient: ‘distribution free definition and test’. Biometrics 60, 517-526. Commenges, D., Letenneur, L, Jacqmin, H., Moreau, Th, and Dartigues, J- F. (1994) “Test of homogeneity of binary data with explanatory variables’. Biometrics, 50, 613-620. Coole, RD. and Weisberg, S. (1982) Residuals and Influence in Regression. New ‘York and London: Chapman & Hall Cook, RID. and Weisberg, S. (1094) An Introduction to Represion Graphics. New ‘Yorks Wiley. Cook, TD. and Campbell, DT. (1979) Quasi-esperimentation, Design 8 Anciysis "asues for Field Settings, Boston: Houghton Miflin Company. Crowder, MJ. and Hand, D.J. (1990) Monoprophs on Statistics and Appsed Prob: ability 41. London: Chapman & Hall Davidian, M. and Gitinan, DIM. (1095) Nonlinear Models for Repested Measure ‘ment Dota. London: Chapman & Hall Davia, JAA. Spaeth, J:L., and Huson, C. (1961) ‘A technique for analyzing the ‘fects of group composition’. American Sociological Review, 26, 215-225. De Lecuw, J, and Keefe 1. (1986) ‘Random coeficiext models for multilevel anal ‘ris Journal of Educational Statistics, 11 (1), 87-8. De Weerth, C. (1998) Emotion-reloted Behovior in Infents. Ph.d, thesis, Univer- sity of Groningen, The Netherlands. Dekkers, HPJ-M,, Bosker, RJ., and Driessen, G.W.J.M. (1998) ‘Complex ia equalities of educational opportunities’. Submitted for publication. Diggle, PJ, Liang, KY, and Zoger, SL. (994) Analysis of Longitudinal Dato. Oxford: Clarendon Press. Dixon, WJ. (1992) BMDP Statistical Software Manual. Volume 2. Berkeley, Los ‘Angeles: University of California Press. Dixon, Wal, and Merdian, K. (1952) ANOVA and Regression with BMDP. Los “Angeles: Dixon Statistical Associates. Dogan, M. aud Rotlan, 8. (eds) (1969) Quantitative Bcolopical Analysis in the ‘Social Sciences. Cambridge, Masa: ‘The M.LT. Press. Donner, A. (1986) ‘A review of inference procedures for the fatraclass correlation coeficient in the one-way random efects model’. International Statistical Review, 54, 67-2.254 References Duncaa, O.D., Curzort, RP, and Duncan, RP. (1961) Statistica! Geography: ‘Problems in Analysing Areal Deta. Glencoe, IL: Bree Press Biton, B. and Morris, CN, (1978) ‘Data analysis using Stein's estimator and its ‘generalizations’. Journal of the American Statitical Association, 74, 311 319. Eisenhart, C. (1947) ‘The assumpti metrics, 3, 1-21 Engel, J. and Keen, A. (1994) ‘A simple approach for the analysis of generalized linear mixed model’. Statistica Neeriandica, 48, 1-22. Fisher, RA. (1032) Statistical Methods for Research Workers, 4th edn. Edinburgh: Oliver & Boye, Gibboas, RD. aad Bock, RD. (1987) “Trends in correlated proportions’. Poy. chometra, 52, 118-124, Gibbons, RD. and Hedeker, D. (1994) ‘Application of random-effects probit r= (gression models’. Journal of Consulting and Clinical Paychology, 62, 285- 296. Gibbons, RD. and Hedeker, D. (1997) ‘Random effects probit and logistic regression models for three-level data’. Biometrics, $8, 1527-1597 Gilks, WR, Richardson, S, and Spiegelhalter, DJ. (1996) Markov Chain Monte (Carlo in Practice. London: Chapman & Hall Glass, G.V. and Stanley, 1. (1970) Statistical Methods in Education and Paycho- logy. Englewood Cis, NJ: Prentice Hall. Goldstein, H. (1986) ‘Multilevel mixed liar model analy eralized Inst oquares'. Biometrika, 73, 43-56. Goldstein, 1, (1987) ‘Multilevel covariance component models’. Biometrika, 74, 430-431, Goldstein, H. (2091) ‘Nonlinear multilevel models with an application to discrete response data’. Biometrits, 78, 45-8. Goldstein, H. (1995) Moltilevel Statistical Models, 2nd edn. London: Edward “Arnold, In electronic form at http://www aenoldpublisher.com/support/goldstein.htm Goldstein, H. and Healy, M.1.R. (1998) ‘The graphical presentation of a collection ‘of meas’. Journal of the Royal Statistical Society, Ser. A, 158, 175-177 Goldstein, H., Healy, M-JR., and Rasbash, J. (1994) ‘Multilevel time series models ‘with applications to repeated measures data’, Statistics in Medicine, 18, 1643-1655. Goldstein, H. and Rasbash, J. (1996) ‘Improved approximations for multilevel ‘models with binary responses’. Journal ofthe Royal Statistical Society, Ser. A, 159, 505-513. Goldstsin, H,, Rasbash, J., lewis, L, Draper, D., Browne, W., Yang, M., Wood: ‘howe, G., and Healy, M. (1998) A wer's guide to MLwiN, London: Multi level Models Project, Institute of Education, University of London. Gouriéroux, C., and Montfort, A. (1996) Simulation-based Econometric Methods. Oxford: Oxford University Press. Guldemond, H. (1994) Van de Kikker en de Vier (‘About the Frog and the Pond’), Phd. thesis, University of Amsterdam. Hagl, TM, and Mitchell I, GB. (1992) 'Goodness-of-St measures for probit and logit’. American Journal of Political Science, 36, 762-784. Haldane, JB.S. (1940) ‘The mean and variance of x?, when used as a test of ‘homogeneity, when expectations are sual’. Biometrika, 31, 346-356. 39 underlying the analysis of variance’. Bi using iterative gen- References 255 Hamerle, A. and Ronning, G. (1995) ‘Panel analysis for qualitative variables In: ‘Arminger, Clogs, and Sobel (1998), pp. 401-451. Hand, D. and Crowder, M. (1996) Practical Longitudinal Data Analysis. London: (Chapman & Hall Hausman, J.A. and Taylor, WE. (1981) Panel data and unobservable individual effects’. Beonometrica, 49, 1377-1998. Hays, W.D. (1988) Statistics. 4th eda. New York: Holt, Rinehart and Winston. Hedeker, D. and Gibbons, RD. (1994) ‘A random effects ordinal regression model for mukilevel analysis. Biometrics, 50, 983-044 Hdeler, D. and Gibboas, RLD. (1996a) (MIXOR: A computer program for mixed- ‘fects ordinal regression analysis’. Computer Methods and Programs in Biomedicine, 49, 157-176. Hedeler, D. and Gibbons, RD. (19960) ‘MIXREG: A computer program for ‘mixed: effects regression analysia with autocorrelated errors!. Computer Meth- ‘ods and Programs in Biomedicine, 9, 229-252. Hedeker, D., Gibbons, R.D., and Waternatx, C. (1999) ‘Sample size estimation for longitudinal designs with attrition: Comparing timerelated contrasts between two groupe.’ Jounal of Educational and Behavioral Statistics, 24, 70-83. Hedges, L.V. (1902) ‘Meta-analysis’. Journal of Educational Statistics, 17, 270- 206, Hedges, LV. and Olkin, 1. (1985) Statistical Methods Jor Meta-analysis. New "Yorke Academic Pres Heyl, E. (1996) Het Docentennetwerk. Structuur en Insloed van Colepile Con- tacten Binnen Scholen. Phd. thesis, University of Twente, The Netherlands, Hilden-Minton,J.A. (1995) Multilevel Diagnostics for Mised end Hierarchical Lin- ‘ar Models. Phd, disserstion, Department of Mathematics, University of California, Los Angeles. Hill, PW. and Goldstein, H. (1998) “Multilevel modeling of educational data with ccroseclasifcation and missing identification for unts!. Journal of Educa- tional and Behavioral Statistics, 3, 117-128. Hodges, JS. (1998) ‘Some algebra and geometry for hierarchical linear models, ‘applied to diagnostics’. Journal of the Royal Statistical Society, Ser. B, 60, 407-836. Holland, P.W. and Leiahardt, S. (1981) ‘An exponential family of probability istributions for directed graphs! (with discussion). Journal ofthe American Statistical Association, 76, 38-65. Hosmer, D.W. and Lemeshow, 8. (1980) Applied Logistic Regression. New York: ily. Hox, 1.5. (1994) Applied Multilevel Analysis. Amsterdam: TT-Publizaties. Avail “able in electronic form at http://w. joe. ac.uk/mulilevel/amaboek pdf. Hsiao, C. (1995) ‘Panel analysis for metric data’ In; Arminger, Clogs, sad Sobel (41998), pp. 361-400. ‘Hittner, HJ.M, (1961) ‘Contextuele analyse’ (Contextual analysis). Tn: Albinski, 1M: (ed), Onderaockatypen in de Socilogie,p. 262-288. Assen: Van Gorcum. Hittner, HJ.M, and van den Eeden, P. (1995) The Malilevel Design, A Guide with an Annotated Bibligrephy, 1980-1098. Westport, Cona.: Greeawood Press. International Social Survey Programine (1994) Family and Changing Gender Roles IT (computer-fe]. Cologne: Zentral Archiv (Z.A. number 2620),256 References Keplan, D. and Elliott, P-R. (1997) ‘A didactic example of mubilevel structural “equation modeling applicable to the study of organizations’. Structural Equation Modeling, 4, 1-24. Kasia, RM, and Raudenbush, SW. (1998) ‘Application of Gibbs sampling to ‘ested variance:componeats models with heterogeneous within-group var ance’. Journel of Edveational and Behavioral Statistics, 23, 98-116. Kelley, TL. (1927) The Interpretation of Educational Measurements. New York: ‘World Books. Kaapp, TR. (1977) "The usit-ofanalysis problem in applications of simple cor elation analysis to educational research’, Journal of Bducational Statistics, 2, 171-186, Keefe, LG.G., De Leeuw, J, and Aiken, (1995) “The effect of diferent forms of ‘centering in hierarchical Linear model. Multivariate Behavioral Research, 30, 1-22, Kreft, LG.G. and De Leeuw, J. (1998) Introducing Multilevel Modeling, London: ‘Sage Publications Kuk, AY.C. (1995) ‘Asymptotically unbiased estimation in generalized Tinea models with random effets”. Journal of the Royal Statistical Society, Ser. B, 87, 395-407 Laird, NIM. and Wate, J.H. (1982) Random-effects models for longitudinal data! ‘Biometrics, 38, 963-974. Langford, LH. and Lewis, T. (1998) ‘Outliers in multilevel data’. Journal of the Royol Statistical Society, Ser. A, 161, 121-360. Lesaffre, E. and Verbeke, G. (1998) “Local infuence in linear mixed models: Biometrics, 4, 570-882. Lin, X. (1999) ‘Variance component testing in goneralised linear models with random effects’. Biometrika, 84, 309-326, Littell, R.C., Millen, G.A., Stroup, W.W., and Wollinger, RD. (1996) $AS System for Mized Models Cary, NC: SAS Institute. Little, RJ.A. and Rubin, D.B. (1987) Statistical Analysis with ‘York: Wiley. Long, JS. (1997) Regression Models for Categorical and Limited Dependent Vari- ables. Thousand Oaks, CA: Sage Publications. Longford, N-T. (1987) ‘A fast scoring algorithm for maximum likelihood extima- tion in unbalanced mixed models with nested random effects’. Biometrika, 74 (4), 812-827. Longford, N-T. (1908s) Random Coefficient Models. New York: Oxford University Press Longford, NT. (19938) VARCL: Software for Variance Component Analysis of ‘Data with Nested Random Effects (masimum Wkeibood). Manual. Gronin- gen: ProGAMMA. Longford, NT. (1994) ‘Logistic regression with random coeficienta’. Compute tional Statiticy and Dato Analysis, 17, 1-1. Longford, NL. (1995) ‘Random eoeficient model’. In: Arminger, Clogs, and Sobel (1095), pp. 619-577. Lord, FM. and Novick, MR. (1968) Statistical Theory of Mental Test Scores Reading, Mass.: Addison-Wesley. ‘Maas, C. and Soiders, T.A.B, (1999) ‘Tae multilevel approach to repeated mea- ‘sures with missing data’. Submitted for publication. sing Data. New References 257 Maddala, GS. (1971) ‘The use of variance component models in pooling cross section and time soies data’. Eenometrica, 39, 341-358. ‘Mason, W.M, and Fienberg, SE. (eds) (1985) Cohort Analysis in Social Research. syond the Identifiation Problem. New Yerk: Springer. Mason, W.M., Wong, G.M., and Eatwisle, B. (1983) ‘Contextual analysis through ‘the taultilevel near model’. In: Leishasdt, 8. (ei.), Sociological Methodol- egy - 1989-1984 pp. 72-108. San Francisco: Jossey-Bass. Macwell, SE. and Delaney, HD. (1990) Designing Esperiments and Analysing ‘Data, Belmont: Wadsworth, Inc. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models. 2nd edn ‘London: Chapman & Hall MeCulloch, C.E. (1994) ‘Maximum likelihood variance components exiation for binary data’. Journal of the American Statistical Association, 89, 330-335. McCulloch, ‘CE. (1997) ‘Maximum likalibood algorithms for generalized linear ‘mixed models. Journal of the American Statistical Association, 92, 162- 170. McKelvey, R.D. and Zavoina, W. (1975) ‘A statistical model for the analysis of ‘ordinal level dependent variables’. Journal of Mathematical Sociology, 4, 103-120. ‘Meal, F. and Rampichni,C. (1999) ‘Estimating binary muitilevel models through indirect inference’. Computational Statistics and Data Analysis, 29, 313 344, Majer, E., Van Der Leeden, R., and Busing, F-M-T-A. (1098) ‘Implementing the ‘Bootserap for multilevel models’. Mulslevel Modeling Newsleter, 7 (2), 7-11. Miller, JJ. (1977) ‘Asymptotic properties of maximum lkelihood estimates in ‘the mixed model of the analysis of variance’. The Annals of Statistics, 5, 1746-762. Moerbeek, M, van Breukelen, GIP, and Berger, MP-P. (1997) ‘A comparison of the mixed effect, the fixed effect, and the data aggregation model for rultilevel data’. Submited for publication. ‘Mok, M. (1995) ‘Sample size requiements for 2-level designs in educational research’. Multilevel Modeling Newsletter, 7 (2), 11-15. Mosteller, F. and Tukey, LW. (1977) Data Analysis and Regression Reading, ‘Macs.: Addison-Wesley ‘Muthén, B.O, (1994) ‘Multilevel covariance structure analysis, Sociological Meth ods & Research, 22, 376-398. Muthéa, LK. and Muthés, B.O. (1998) Mplus, The Comprehensive Modeling ‘Progrom for Applied Researchers, User's Guide. Los Angeles: Muthén and Mathes. Neuhaus, JM., Kalbflisch, J.D., and Hauck, W.W. (1991) ‘A comparison of ‘luster specfic and popslation-averaged approaches for analyzing correlated binary data’. International Statistical Review, 58, 25-85. Neuhaus, J.M. (1992) ‘Statistical methods for longitudinal and cluster designs ‘with binary responses’. Statistical Methods in Medical Research, 1, 249-273. Ochi, Y. and Prentice, RL (1984) ‘Likelihood inference in a correlated probit regression model’. Biometrika, 71, 581-548. Opdenalder, M.C. and Van Damme, J. (1997) ‘Centreren in multi-level analyse: implicaties van twee centreringsmethoden voor het bestuderen van school lectviteit. Tejdachrft voor Onderwijoresearch, 22, 264-200,258 References Pedhazur, EJ. (1982) Multiple Regression in Behovioral Research. 2nd edn. New ‘York: Holt, Rinehart and Winston. Pregibon, D. (1981) ‘Logistic Ragreasion Diagnostics’. The Annals of Statistics, 9, 705-724 rea, SJ. (1989) Bayesian Statistics. New York: Wiley. Rasbash, J. and Goldseela, H. (1994) ‘Eflicient analysis of mixed ierarci ‘and crosclasified random structures using a multilevel model. Journal of Educational and Behavioral Statistics, 19, 337-850. Rasbash, J. and Woodhouse, G. (1995) MLn: Command Reference. London: Maltilevel Models Project, Institute of Education, Univesity of London, Raudenbush, S.W. (1993) ‘A crossed random effects model for unbalanced data with applications in cross sectional and longitudinal research’, Journal of Educational Statistic, 18, 321-349. Raudenbush, SW. (1995) Hierarchical linear models to study the effects of social context on developmant’. In: Gottman, J.M. (ed), The Analysis of Change, p. 165-201. Mabwab, NJ Lawrence Erlbaum Ass Randenbush, S.W. (1997) Statistical analysis and optimal design for losterran- domized trials’. Payhological Methods, 2, 173-185. ‘Raudenbush, S.W. and Bryk, A'S. (1986) ‘A hierarchical model for studying school eects”. Sociology of Bducation, 69, 1-1 Raudenbush, SW. and Eryk, AS. (1087) ‘Examining correlates of diversity’ ‘Journal of Bducatioral Statistics, 12, 241-269. Randenbush, SW. and Chan, W-S. (1994) ‘Application of a hierarchical linear model to the study of adolescent deviance in an overlapping cohort design’ Journal of Consulting and Clinical Prychology, 61, 941-951. ‘Raudenbush, S.W. and Sampson, R.(1900) ‘Assessing dioct and indirect effects in ‘multilevel designs with latent variable’. Sociological Methods and Research, to appear Randenbush, SW., Yang, M-L., and Yosef, M. (1999) ‘Maximum likelihood for ‘generalized linear medels with nested random effects via high-order, multi variate Laplace approximation’. Manuscript under review. Rekero-Mombarg, L-T/M., Cole, L-T., Massa, C.G., and Wit J.M. (1997) ‘Longh ‘tudinal analysis of growth in children with iopathic short stature.” Annals of Haman Biology, 24, 569-583. ‘Richardson, A.M. (1997) ‘Bounded influence estimation in the mixed linear made? ‘Journal of the American Statistical Association, 92, 154-161. Robingoa, W.S. (1950) ‘Ecological correlations and the behavior of individuals. “American Sociological Review, 15, 351-357. Rodrigues, G. and Goldman, N. (1995) ‘An assessment of estimation procedures for multilevel models with binary responses’. Journal ofthe Royel Statistical Society, Sex. A, 158, 73-89. Roseathal, R. (1991) Méde-aneiytic Procedures for Social Research (rev. ed.) ‘Newbury Park, CA: Sage Publications. Rubia, D.R. (1096) Multiple imputation after 18+ years. Journal ofthe American ‘Statistical Association, OL, 473-490 Ryan, TP. (1997) Modern Repression Methods. New York: Wiley Sirndal, C-E., Swensson,3., and Weetman, J. (1991) Model-osssted Survey Sam- ling. New York: Springer. Searle, S.R. (1956) ‘Matric methods in components of variance and covatiance ‘analysis’. Annals of Mathematical Statistics, 0, 167-178. References 259 Searle, SR, Casella, G,, and McCulloch, CE. (1992) Variance Components. New ‘Yorks ‘Wiley, Seber, G.A.P. and Wild, C.J. (1989) Nonlinear Repression. New York: Wiley Self, GS. and Liang, KY. (1987) ‘Asymptotic properties of maximum likelihood ‘estimators and lelihood ratio tets under nonstandard conditions’. Journal of the American Statistical Association, 82, 05-610 Seltzer, MH. (1993) ‘Sensitivity analysis for fixed effects ia the hierarchical model: ‘A Gibbs sampling approack’. Journal of Education Statistics, 18, 207-235. Seltzer, M.H., Wong, W.HLH., and Bryk AS, (1996) ‘Bayesian analysis in appli cations of hierarchical models: Issues and Methods’. Journal of Educational ‘Statistic, 21, 131-167 Shavelson, RJ. and Webb, N.M. (1991) Generalisability Theory. A Primer. New- ‘bury Park, CA: Sage Publications, Singer, J.D. (1998a) ‘Fitting mukilevel models using SAS PROC MIXED.’ Multi- level Modeling Newsletter, 10 (2), 5-8. Singer, J.D. (1998b) ‘Using SAS PROC MIXED to fit muilevel models, hierarchical models, and individual growtl models.” Joumal of Educational ond Behavioral Statistics, 28, 328-355. Saljders, T.A.B. (1996) ‘Analysis of longitudinal data using the hierarchical inear model’. Quality & Quantity, 30, 405-426. Snijders, T-A.B. and Bosker, RJ. (1903) ‘Standard errars and sample sizes for ‘wro-level research’. Journal of Bducational Statistics, 18, 237-259, Snijders, T-A.B. and Bosker, RJ. (1994) ‘Modeled variazce in two-level models’. Sociological Methods 8 Research, 2, 342-963. Snijders, T.A.. and Kenay, D.A. (1098) ‘Multilevel models for relational data’. Personal Relationships (in pres). Snijders, T-A.B, and Maas, CJ.M. (1996) ‘Using MLn for repeated measures with ‘missing data’. Multilevel Modelling Newsletter, 8 (2), 7-10 Snijders, T.A.B., Spreen, M., and Zwaagstra, R. (1004) "The use of multilevel ‘modeling for analysing persoaal networks: networks of cocaine users in aa urban area’. Journal of Quantitative Anthropology, 4, 85-105. SPSS Inc. (1997) SPSS Advanced Stetistics 7.5. Chicago, Il: SPPS Inc. StataCorp (1997) Stete Statistical Software: Release 6.0. College Station, TX: ‘Stata Corporation. Stevens, J (1996) Applied Multivariate Statistics forthe Social Sciences. Mabwab, (NJ: Lawrence Erlbaum Astociates, Stiraeli, R, Laird, NM, and Ware, JH, (1984) ‘Randomeffects models for serial observations with binary response’. Biometric, 40, 961-971, Swamy, P.A.V.B, (1971) Statistical Inference in Random Coefficient Regression ‘Models. New York: Springer Swamy, P.A.V.B. (1973) ‘Criteria, constrains, and multicollineaity ia random coeficient regression models. Annals of Keonomic and Social Measurement, 2, 429-450, aca, J. (1986) Von Multiniveou Probleer noar Multiveas Analyse. Rotterdam: Department of Research Methods and Techniques, Erasmus Univeraty. Van Der Wert, G.Tb., Smith, RJ.A,, Stewart, RE, and Meyboom - de Joag, B. (1998) Spiegel op de Huisarts. Over Registrate van Zieke, Medicatic en Ver. swizingen in de Ce260 References ‘Van Duija, M.A.J., Van Busschbach,J:T., and Snijders, T.A.B. (1999) ‘Mauilevel ‘analysis of personal aetworks as dependent variable’. Social Network Veall, MR. and Zimmermann, K.P. (1992) ‘Pseudo-R"s in the ordinal probit ‘model. Journal of Mathematical Sociology, 16, 899-342. “Verbete, G, aad Lomas, E.(1997) ‘The effect of misoperifying the random elects stribution in Linear mixed models for longitudinal data’. Computational Statistica and Data Analysis, 23, 541-556. Verbece, G.and Molenberghs, G. (1997) Linear Mized Models in Practice. A SAS- Sriented Approach. Lecture Notes in Statistics, 126. New York: Springer. Vermeulen, C5. and Bosker, RJ, (1992) De Omong en Gevolgen van Deetija: Srbeid en Volledige Insctboarheid in het Booisonderwijs, Enschede: Univer- sity of Twente. Volker, B.G.M. (1996) Should Auld Acquaintance Be Forgot? Inatittions of Com- ‘muniam, the Transition to Capitalism and Personal Networks: The Cave of, Bast Germany, Amsterdam: Thesis Publishers. ‘Volker, B. and Flap, H. (1997) ‘The comrades’ belie: Intended and unintended consequences of communism for neighbourhood relations in the former GDR. Buropean Sociological Review, 13, 241-265, Wasserman, S. and Faust, K. (1994) Social Network Analysis: Methods and Ap- plications, New York and Cambridge: Cambridge University Press. Watemnaux, C., Laird, NM, and Ware, JH. (1989) ‘Methods for analysis of Tongitudizal data: Blood lead concentrations and cognitive development.” ‘Journal of the American Statistical Association, 84, 39-41. Windmeijer, F-A.G. (1998) ‘Goodness-of ft measures in binary choice models. Beonometrie Reviews, 14, 101-116 Wittek, R. and Wielers, R. (1998) ‘Gossip in organizations’. Computational and ‘Mathematical Organization Theory, 4, 189-204 Wong, G.Y. and Mason, W.M. (1985) The hierarchical logistic regression model Tor midtilevel analyse. Journal of the American Statistical Assocation, 80, 513-524. Yang, M., Rasbash, 3., and Goldstein, H. (1998) MLuniN Macros for Advanced Maltievel modelling, Loudon: Multilevel Models Project, Institute of Edu- cation, University of London. Zeger, §L. and Karim, MLR. (1991) ‘Generalized linear models with random ef ects: a Gibbs sampling approach’, Journal of the American Statistical Association, 86, 79-86 Index aggregation, 11, 13-15, 26, 52, 124 of dichotomous data, 219 algorithms, 57, 82, 218-219 analysis of covariance, 22, 29, 42, 44, 91, 122 analysis of variance, 17, 21, 45, 91 ANCOVA, see analysis of| covariance ANOVA, ace analysis of variance assumptions, 98, 120, 121 of hierarchical linear model, 68, 120-121 of random intercept model, 47, 51 autocorrelated residuals, 199 Bayesian statistics, $9, 251 ‘between-group correlation, 31-35 Detween-group covariance matrix, 203 between-group regression, 1, 14, 27-81, 52-56, 80-81, 86, 87, 122 in three-level model, 65 between-group variance, 17, 19, 20, 39 {for dichotomous data, 210, 214 between-subjects variables, 169 BMDP, 199, 250 ‘Bonferroni correction, 137, 197 bootstrap, 218, 251 ‘budget constraint, 146, 150-152, 154 BUGS, 251 categorical variable, 88, 89, 115, 248 261 orcered, see ordered categorical variable ‘centering, 74, 80, 105, 123, 242 changing covariates, 195 chi-squared test, 91, 209, 210 Cholesky decomposition, 247 combination of independent tests, 36 complete data vector, 170, 200 ‘comporents of variance, 105-109 compotnd symmetry model, 168, 173, 174, 179 confidence interval, 61 ‘contextual analysis, 1-2 contextual effects, 53, 122 ‘contextual variable, 122 contingency table, 209 Cook's distance, 134 ‘correlates of diversity, 117 correlation ratio, 30 count data, 207, 234-238, 248, covariance matrix, 69, 170 crose-level inference, 16 crose-level interaction, 11, 15, 40, 52, 73-76, 96, 97, 122, 148, 169, 193 cross-validation, 117, 121, 130 ‘crossed random effects, 155-165 correlated, 160-165 in three-level model, 159-160 crossed random slopes, 160-161 degrees of freedom, 56, 86, 89 dependent variable, 30 design effect, 16, 22-24, 142 design of multilevel studies, 140-154 design-based inference, 2262 INDEX deviance, 88, 214, 218, 220, 247 deviance test, 56, 82, 88-91, 220, 27 deviation score, 54, 66, 87, 100, 12 dichotomous variables, 207-229, 248 disaggregation, 15-16 ‘Gummy variables, 88, 89,111, 115 for crose-classified models, 156, 159, 161 for longitudinal models, 168, 174 for multivariate models, 202 ecological fallacy, 1, 14, 16, 56 cffect size, 50, 99-105, 141-142 effective sample size, see design effect EM algorithm, 57, 82 emergent proposition, 11 empirical Bayes estimation, 58-59, 61, 82 empirical Bayes residuals, 132 empty model, 17, 45-47, 58 for dichotomous data, 209, 213-215 for longitudinal data, 168, 182 multivariate, 176, 203 estimation methods, 56-58, 82-89, 218-219 exchangeability, 42, 68 expected value, 5 ‘explained variance, 99-105, 123 at level one, 102 at level two, 103 for discrete dependent variable, 225, 231 in longitudinal models, 179-180 in random slope models, 104 in three-level models, 104 explanatory variables, 39, 47, 51, (67, 75, 79, 120, 169, 182, an Fetest, 21, 22, 91 Fisher scoring, 57, 82, 123, 135 Fisher's combination of p-values, 36 fixed coefficients, 43 fixed occasion designs, 167-180 fixed ot random effects, 4144 fixed part, 51, 54, 68, 79, 121, 124-195 frequency data, 234 fully multivariate model, 174, 179, 200 ‘generalizability theory, 24, 155 generalized linear models, 208 Gibbs sampling, 139, 218, 251 Greek letters, 5 group size, 52, 57, 151, 152 ‘groups, 17, 38, 67 Housman specification test, 54,87 heteroscedasicity, 68-69, 110-119, 172, 248 at level one, 110-118, 126-128, 198 at level two, 119 test of eve-one, 126-128, hierarchical generalized linear models, 207-238, hierarchical linear model, 2, 38, 45, 67-85, 108 assumptions, 68, 120-121 erarchical mode, 92 history of multilevel analysis, 1-2 HLM, 94, 105, 110, 199, 218, 220, 232, 235, 240 Domoscedasticity, 10 IGLS, see iterated generalized least squares Incomplete data, 175, 177 independent variable, 39 influence, 128 influence diagnostics, 134-139 intercept, 27, 40, 41 random, 42, 45 Intercept variance, 46-48, 154 INDEX 263 intercept-slope covariance, 69, 71, 93 intercepts as outcomes, 1, 73 intraclass correlation, 16-35, 46, 48, 91, 94, 145, 151 for binary data, 209, 224-226 for ordered categorical data, 231 for three-level model, 65 residual, see residual intraclass correlation item response theory, 248 iterated generalized least squares, 57, 82, 123, 135, jackknife, 251 Kelley estimator, 59 latent variables, 58 level, 6, 8 dummy, 157 pseudo, 157 level-one residuals, 4¢e residuals, level-one level-one unit, 8, 9, 16, 25, 38, 63, 67, 145 level-one variance, 154 level-two random coeflicents, 68, 121 level-two residuals, see residuals, level-two level-two unit, 8, 9, 16, 25, 38, 63, 67, 145 leverage, 135-139 likelihood ratio test, see deviance test link fonction, 212 log odds, 212 logistic distribution, 223 logistic function, 213, logistic regression, 128, 208, 212, 215 logit function, 212 longitudinal data, 2, 9, 166-198, 200, 248 ‘macro-micro relations, 9, 11, 13 ‘macro-level propositions, 10 macro-level relations, 26, rmacro-unit, see level-two unit ‘marginal quasi-likelinood, 218 maximum likelihood, 21, 34, 47, 56, 82, 88, 123 meta-analysis, 35, 242 rmicro-macro relation, 11 ‘micro-leve! propositions, 10, 13, micro-level relations, 14, 26 micro-unit, see level-one unit missing data, 52-53, 175 misspecification, 120, 123, 125, 129 mixed model, 1, 38-85 MIXNO, 248 MIXOR, 218, 220-222, 232, 248 MIXPREG, 235, 248 MIXREG, 199, 248 ML, see maximum likelihood MLA, 154, 218, 251 MLn, see MLwiN MLwiN, 110, 154, 157, 199, 218, 221, 232, 235, 243 model specification, 52-56, 69-70, 71, 80-82, 91-98, 120-139 of three-level model, 83 model-based inference, 2, 3, 45 Mplus, 251 multilevel logistic regression, 208-229 Multilevel Models Project, 2, 239 multilevel ordered logistic regression model, 128, 230-234 multilevel ordered probit model, 230, 231 multilevel Poisson regression, 234-238, 248, 250 multilevel probit regression, 224, 225, 250 ‘multilevel proportional odds model, 230 multilevel structural equation ‘models, 251264 INDEX multinomial logistic regression, 248 multiple correlation coefficient, 99, multiple regression, 39 ‘multivariate analysis, 201 multivariate empty model, 203 multivariate multilevel model, 200-206 multivariate regression analysis, 178 multivariate residual, 136 nested data, 3, 6-9 non-linear models, 186, 207-238 normal probability plot, 130, 132, 133 notational conventions, 5 ‘ull hypothesis, 36, 86, 88-90 ‘numerical integration, 218, 247 observed between-group correlation, 31 observed variance, 17-19 odds, 211 offset, 236 OLS, 22, 41, 43, 48, 50, 58, 91, 94, 101, 126, 129 one-sided test, 86, 90 ‘one-step estimator, 123, 135 ‘ordered categorical variable, 207, 229, 248 ordinary least squares, see OLS outliers, 132 panel data, 2, 166 parallel test items, 143 parameter estimation, ‘ee estimation methods permutation test, 210 piecewise linear function, 186 PioT, 144, 251 Poisson distribution, 234 Poisson regression, 208, 234 polynomial function, 183-186 polynomial random part, 183 polynomial trend analysis, 172 population, 16, 45 population between-group variance, WT population of curves, 181 population within-group variance, 7 posterior confidence intervals, 60-63 posterior intercept, 60 posterior mean, 58-63, 192, 143 posterior slope, 82 Power, 141-142, 251 predictor variable, 39, 65 probit regression, 224 pseudo level, 158 psychological test theory, 24, 69 quasi-ikelihood, 218 FA, see explained variance random coefficient model, 38, 43, 68 random coefficients, 2, 42, 68 random effects, 1, 2,17, 43, 45 ‘comparison of intercepts, 62 random effects ANOVA, 45 random interaction, 68 random intercept, 42, 45, 51, 67 test, 89-90 random intercept model, 38-66, 106, 168 comparison with OLS model, 50 for dichotomous data, 215-218 ‘multivariate, 201-205 three-level, 63 random part, 51, 68, 79, 121, 125-128 test, 88-91 random slope, 68, 70, 74, 105, 122 explanation, 72-79 test, 89-91, 96, 123 random slope model, 67-85, 108 for dichotomous outcome variable, 220-223, multivariate, 206 random slope variance, 68, 71 interpretation, 70 rellabilty, 24-26, 59, 144 INDEX 265 REML, see residual maximum likelihood repeated measures, 143, 166 residual intraclass correlation, 48, 57, 68, 171 residual iterated generalized least squares, 87, 82 residual maximum lkelihood, 21, 134, 47, 56, 82, 89, 175 residual variance, 48, 69, 73, 121, 128 for dichotomous data, 209 non-constant, 110, 114, 117 residuals, 27, 40, 42, 45, 47, 68, 73, 76 level-one, 121, 128-132 level-two, 192-139 multivariate, see multivariate residual non-normal distributions, 139 restricted maximum likelihood, see residual maximum Wkelihood RIGLS, see residual iterated generalized least squares sample ‘multi-stage, 6, 7, 41 two-stage, 6, 16,28, 142, 144, 146 sample sive, 22, 140-154 {or estimating fixed effects, 14-150 for eatimating group mean, 143-144 for estimating intraclass correlation, 151-153 {or estimating population mean, 112-143 for eatimating variance parameter, 154 sampling designs, 6 sandwich estimator, 139, 250 SAS, 110, 199, 248 shift of meaning, 13, 50 shrinkage, 59-61, 63, 82 significance lve, 14. slopes as outcomes, 1, 73 sorial networks, 162-165, 22. software, 239-251 spline fnetion, 113, 125,172, 189, 216 cable, 189, 190 auadratic, 190 SPSS, 249 standard error, 23, 36, 141-142, Pi of empirical Bayes estimate, 58,61 of xed coefficients, 144, 251 of intercept variance, 164 of intraclass correlation, 21, 151 of eve-one variance, 154 of population mean, 23, 148 of posterior mean, 59, 61 of random part parameters, 83, 247 ofstandard deviation, 83, 247 of variance, 83, 247 standardized coeficents, 50 Standardized multivariate residual, sce multivariate residual Stata, 50 sucees probability, 208,211,213, 215, 234 tratio, 55, 86 test, 55, 86 {for random slope, 123 paired samples, 175 test, 86-91, 130, 141 for dichotomous dependent variable, 220 for random intercept, see random intercept {for random slope, see random slope textbooks, 4 threelevel model, 63-66, 83-85, 104, 159, 200, 202266 INDEX for discrete dependent variable, 218 for multivariate multilevel model, 200-206 threshold model, 223, 228; 231 total variance, 17 ‘total-group correlation, 31-35 total-group regression, 27-31 transformation for count data, 234 of dependent variable, 128 cof explanatory variables, 124 true score, 24, 59, 73, 143, unexplained variation, 38, 42, 43, 45, 47, 68, 73, 101, 225, VARCL, 218, 235, 245 variable occasion designs, 181-198 ‘variance components, 99, 105-109, 181 Wald test, 86, 88 ‘within-group centering, 54, 80,123 ‘within-group correlation, 31-35, ‘within-group covariance matrix, 203, ‘within-group deviation score, tee deviation score within-group regression, 1, 14, 27-31, 52-56, 80-81, 86, 87, 122 in three-level model, 65 within-group variance, 17, 20 for dichotomous data, 210 within-subjects design, 168 within-subjects variable, 169 zero variance estimate, 20, 57, 82

Multilevel Analysis

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Multilevel Analysis

Загружено:

Авторское право:

Доступные форматы

Вам также может понравиться