Вы находитесь на странице: 1из 23

International Journal of

Research in Marketing
ELSEVIER Intern. J. of Research in Marketing 13 (1996) 139-161

Applications of structural equation modeling in marketing and consumer research: A review


Hans Baumgartner a,*, Christian Homburg b
a The Pennsyh,ania State Uniuersi~, 707-K BAB, Department of Marketing, Smeal College of Business, Unit'ersi~ Park, PA 16802, USA b WHU Koblenz, D-56179 Vallendar, Germany

Received 15 September 1995; accepted 2 November 1995

Abstract

This paper reviews prior applications of structural equation modeling in four major marketing journals (the Journal of Marketing, Journal of Marketing Research, International Journal of Research in Marketing, and the Journal of Consumer Research) between 1977 and 1994. After documenting and characterizing the number of applications over time, we discuss important methodological issues related to structural equation modeling and assess the quality of previous applications in terms of three aspects: issues related to the initial specification of theoretical models of interest; issues related to data screening prior to model estimation and testing; and issues related to the estimation and testing of theoretical models on empirical data. On the basis of our findings, we identify problem areas and suggest avenues for improvement.
Ke?~vords: Structural equation modeling; Confirmatory factor analysis

1. Introduction

Since the development of a general framework for specifying structural equation models with latent variables - referred to as the Jihreskog-Keesling-Wiley model by Bentler (1980) - and the implementation of the statistical approach in the LISREL computer program, latent variable modeling has become a popular research tool in the social and behavioral sciences. The monograph by Bagozzi (1980) on Causal Modeling is generally credited with bringing the technique to the attention of a wide audience of

* Corresponding author: Tel: (814) 863-3559; fax: (814) 8653015; e-mail: JXB14@psuvm.psu.edu.

marketing and consumer behavior researchers, and articles in which structural equation modeling is used for data analysis now appear routinely in most leading marketing and consumer behavior journals. The popularity of the methodology is apparent from the recent introduction of the eighth version of LISREL (Jtireskog and Stirbom, 1993a - also available as a module in SPSSX) and the emergence of a host of computer programs that can be used as alternatives to LISREL, such as COSAN (Fraser, 1980), EQS (Bentler, 1989 - also implemented in BMDP), EZPATH (Steiger, 1989), LINCS (Schoenberg, 1989), the PROC CALIS procedure in SAS, and RAMONA (Browne and Mels, 1992). Although the potential of structural equation modeling (henceforth referred to as SEM) for compre-

0167-8116/96/$15.00 Copyright 1996 Elsevier Science B.V. All rights reserved SSDI 0 1 6 7 - 8 1 1 6 ( 9 5 ) 0 0 0 3 8 - 0

140

H. Baumgartner, C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

hensive investigations of both measurement and theoretical issues is generally acknowledged (e.g., Anderson and Gerbing, 1988; Bagozzi, 1984; Bagozzi and Yi, 1988; Dillon, 1986; Steenkamp and van Trijp, 1991) and even though the methodology has developed a loyal following in some quarters and continues to attract new users, some authors have commented critically on the technique's value for empirical research, These criticisms range from outright denial of the method's usefulness because of the presumed implausibility of underlying assumptions (e.g., Freedman, 1987) to concerns about the way in which SEM has been applied in practice (e.g., Breckler, 1990; Biddle and Marlin, 1987; Cliff, 1983; Fornell, 1983; Martin, 1987). In our opinion, the methodology has much to offer to the empirical researcher, but there are many pitfalls that can make SEM a dangerous tool in the hands of inexperienced users. The purpose of this paper is to critically evaluate previous empirical applications of SEM in four leading marketing and consumer behavior journals (the Journal of Marketing, Journal of Marketing Research, International Journal of Research in Marketing, and the Journal of Consumer Research) and to provide guidance to future users on how to employ the methodology more appropriately. Specifically, our review has the following objectives. First, we want to document the number of applications of SEM over the years and classify these applications in terms of relevant criteria such as the purpose for which the methodology is used (e.g., investigations of measurement issues, tests of theoretical relationships). To our knowledge, no comprehensive survey of marketing applications has been reported in the literature, and little is known about the use of LISREL and related programs in actual research. Second, we seek to evaluate the quality of applications of SEM by assessing their conformance with formal statistical assumptions required for the valid use of these techniques, their adherence to guidelines derived empirically from simulation studies, and their reliance on rules of thumb proposed by expert practitioners. Initial applications of methodologies, particularly if they are as complex as SEM, are prone to misuse, and an indepth analysis of a critical mass of empirical studies should point to problem areas and suggest avenues for improvement.

2. Previous applications of SEM


We selected the Journal of Marketing, Journal of Marketing Research, International Journal of Research in Marketing, and the Journal of Consumer Research as the journals most representative of research in the fields of marketing and consumer behavior. All issues between 1975 and 1994 were searched for empirical applications of SEM. Theoretical papers dealing with issues related to SEM and papers in which only simulated data were analyzed or actual data were analyzed for illustrative purposes only were not considered. Similarly, conventional exploratory factor analysis models, path analysis and other structural models estimated by regression methods (e.g., models estimated by two-stage least squares), nonlinear structural models, partial least squares (PLS) models (cf. Fornell and Bookstein, 1982), and ordinal and limited observed variable models (e.g., those that can be estimated with the LISCOMP program of Muth6n (1987)) were excluded from the sample. Essentially, our data base of applications of SEM includes confirmatory measurement models, single-indicator structural models provided they were estimated by a program normally used for latent variable modeling, and integrated measurement/latent variable models. In total, we found 149 applications that satisfied our selection criteria, i Fig. 1 graphs the number of applications between 1977 (the first year an application was found) and 1994, both overall and by journal. It is apparent that the use of SEM in the four journals has increased fairly steadily over the years. When the number of applications (overall and separately for each journal) is regressed on the linear and quadratic effects of time (plus a dummy variable for 1982 in the analysis for the Journal of Marketing Research and in the overall analysis to correct for the large number of applications found in the November issue of this journal due to a special issue devoted to SEM), only the linear trend of time and the dummy variable are significant (the only exception being a significantly positive quadratic effect for the Journal of Marketing), indicating that in general the use of

i A listing of the 149 applicationsincludedin the meta-analysis may be obtainedby contactingthe first author.

H, Baumgartner, C. Homburg~Intern. J. of Research in Marketing 13 (1996) 139-161


number of applications
20 18

141

16
14 12 10

8 6 4
2

0
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94

time

Fig. 1. Applications of structural equation modeling over time.

SEM has neither accelerated nor decelerated over the years. Overall, the Journal of Marketing, Journal of Marketing Research, International Journal of Research in Marketing, and the Journal of Consumer Research accounted for 28, 42, 3, and 28 percent of the total number of applications, respectively. SEM can be regarded as a methodological innovation, and the question arises how this innovation has developed over time. To investigate this issue, the Bass (1969) diffusion model was fit to the data. The Bass model is applicable to non-replacement sales only, which in the present case means that it models only initial applications of the technique by a given author. When there are multiple authors and at least one of the authors has not used the method previously, it is not obvious whether the paper should be regarded as an initial application since knowledge of who conducted the analysis is unavailable. We used the conservative criterion that if at least one of the authors was a first-time user, the paper was included in the sample. A total of 128 papers were thus classified as initial applications. In fitting the Bass model to our data, we regressed the number of first-time applications per year on the number of cumulative applications up to the previous year, the square of the latter term, and a dummy variable for the special issue in 1982. The resulting regression

model fit the data well, accounting for 78 percent of the variance in the dependent variable, and all three predictors were statistically significant (t-values > 11.891, with the coefficient of the quadratic term being negative, as expected). The coefficient of innovation ( p ) was 0.005 and the coefficient of imitation (q) was 0.207, and since q is substantially greater than p the introduction of SEM may be considered a successful innovation (Bass, 1969). Structural equation models can be specified to investigate measurement issues, to examine structural relationships among sets of variables, or to accomplish both purposes simultaneously. Most published applications of SEM are factor-analytic measurement studies (39 percent) and integrated investigations of both the measurement structure underlying a set of observed variables and the structural relations among the latent variables (42 percent). In some cases SEM is also used for examining the relationships among variables which are all measured by single indicators (15 percent), and in five instances multiple uses of SEM were reported (e.g., separate analyses of measurement and structural models were performed). 2 The vast majority of published studies have been conducted with cross-sectional data (93 percent). In one instance the analysis was performed on true longitudinal data covering many time periods, and in nine cases the authors used panel data, where observations at two or more points in time were available for each member of the sample. The almost exclusive reliance on cross-sectional data to investigate structural relationships among constructs (89 percent when measurement studies are excluded from the sample) and the well-known problems of inferring causation from cross-sectional data (e.g., Cliff, 1983; Biddle and Marlin, 1987) suggest that special care be exercised in causally interpreting results derived from (cross-sectional) structural equation models. In fact, it might be advantageous to avoid the term causal modeling altogether and instead talk about SEM, as is done in this paper.

2 Separate analysis of measurement and structural models does not refer to the two-step approach of Anderson and Gerbing (1988). Rather, authors sometimes perform a factor analysis on the set of items available and then combine variables into composites, which are analyzed as single-indicator constructs.

142

H. Baumgartner, C. Homburg~Intern. J. of Research in Marketing 13 (1996) 139-161

As a final characterization of previous applications of SEM, our analysis shows that 85 percent of all authors used LISREL to perform the analysis. In six instances the data were analyzed with EQS, and in 17 cases another program was used or the author(s) did not specifically mention which program was used. The data indicate that LISREL has enjoyed a considerable first-mover advantage, and it will be interesting to see whether any of the newer programs will be able to challenge the hegemony of LISREL and become a serious competitor in the future.

ses in this section is 184, although in some cases it might be slightly smaller because of missing values.
3.1. Issues related to the initial specification o f theoretical models o f interest Model specification. To establish a common terminology for the discussion that follows, we will briefly review some specification issues. Using the LISREL formulation, a full structural equation model can be stated as follows (cf. Bollen, 1989):

(1)
3. Methodological issues in the application of SEM
In discussing methodological aspects relating to SEM and in assessing the quality of published applications, we will consider the following three broad sets of issues (cf. Bagozzi and Baumgartner, 1994): (1) issues related to the initial specification of theoretical models of interest; (2) issues related to data screening prior to model estimation and testing; and (3) issues related to the estimation and testing of theoretical models on empirical data. In contrast to the previous section, where the focus was on tracking the number of articles employing SEM over time, the unit of analysis in this section is a given model. In many papers only a single model is considered. Even if multiple (mostly nested) models are compared using a single sample, but the author(s) present(s) one model that best represents the data, only this final model is analyzed. Thus, in most cases using the article or the model as the unit of analysis leads to the same result. In some papers, however, the same model is estimated on multiple samples (e.g., the model is cross-validated with different respondents), different models are estimated on the same sample (e.g., separate measurement models are specified for different constructs or separate measurement and structural models are investigated), or different models are estimated on different samples (e.g., different model specifications are examined using different respondents). In the first case, the data were averaged across replications before including the application in the analysis. In the other two cases, each distinct model was used separately in the analysis, so that a single paper might contribute several data points. The sample size for most analyy = AYrl + , , x=A~+8.

(2) (3)

Eq. (1) is called the latent variable (or structural) model and expresses the hypothesized relationships among the constructs in one's theory. The m X 1 vector ~7 contains the latent endogenous constructs and the n 1 vector ~ consists of the latent exogenous constructs. The coefficient matrix B shows the effects of endogenous constructs on each other, and the coefficient matrix F signifies the effects of exogenous on endogenous constructs. The vector of disturbances ~" represents errors in equations. If latent variables are specified to have simultaneous effects on each other (so that the B matrix has nonzero elements both above and below the diagonal) a n d / o r if errors in equations are allowed to be correlated, the model is called nonrecursive. If B is subdiagonal and the ~i are uncorrelated, the model is said to be recursive. Special care is required with nonrecursive models because of such issues as model identification, the stability of reciprocal effects, and the interpretation of measures of variation accounted for in endogenous constructs (cf. Schaubroeck, 1990; Teel et al., 1986). Eqs. (2) and (3) are factor-analytic measurement models which tie the constructs to observable indicators. The p X 1 vector y contains the measures of the endogenous constructs, and the q X 1 vector x consists of the measures of the exogenous indicators. The coefficient matrices A y and A ~ show how y relates to r/ and x relates to ~, respectively. The vectors of disturbances e and 8 represent errors in variables (or measurement error). Generally (but not always) the measurement model possesses simple structure such that each observed variable is related

H. Baumgartner, C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

143

to a single latent variable. Models with simple structure and no correlated measurement errors represent unidimensional construct measurement, which is frequently considered to be a highly desirable characteristic of measurement (Anderson and Gerbing, 1988; Gerbing and Anderson, 1988; Hattie, 1985). A total of 73 models in our sample were full structural equation models consisting of Eqs. (1) through (3). 3 In 81 cases, SEM was used solely for investigating the measurement structure underlying a set of observed variables. This submodel, which corresponds to a confirmatory measurement model, is given by either Eq. (2) or Eq. (3). Sometimes, SEM is also applied to examinations of the structural relations among constructs that are all measured by single indicators. The necessary specification is obtained by ignoring unreliability of measurement and setting A ~' and A x to be equal to identity matrices, or by assuming reliability to be known and fixing the factor loadings or error variances accordingly. This was done in 30 cases. In the sequel we will refer to confirmatory measurement models, single-indicator structural models, and integrated measurement/latent variable models as models of type I, II, and III, respectively. 4 Measurement model specification. One important consideration in planning a study is how many observed variables should be used to measure each latent variable and how the various indicators should be related to each construct. It is generally accepted that each construct should be measured by multiple items, but how many items should be used is less clear. On the one hand, a sufficient number of indicators per factor has to be available for a model to be identified, and for estimation problems such as nonconvergence and improper solutions to be minimized it is advantageous to have many indicators

3 Included m this category are several second-order factor models, which are measurement models from a substantive perspective but which are specified as integrated measurement/ structural models for estimation purposes. 4 All three types of models can be specified as single-sample or multi-sample models. If the results of both the single-sample and multi-sample analyses were reported in the paper, the single-sample data were used and averaged before including the application in our sample. In seven cases (three type II models and four type III models) only the results of the multi-sample analysis were reported so that these data had to be used in the analysis.

(e.g., Anderson and Gerbing, 1984). Besides these statistical considerations, a greater number of observed measures is more likely to tap all facets of the construct of interest. On the other hand, the greater the number of indicators per factor, the more difficult it will probably be to parsimoniously represent the measurement structure underlying a set of observed variables and to find a model that fits the data well. Bagozzi and Heatherton (1994) have recently suggested an approach for representing personality constructs that seems applicable to the modeling of measurement structures in general. They distinguish four different levels of abstraction in modeling personality constructs, but in the present context a differentiation into three levels seems sufficient. In the total aggregation model, a single composite is formed by combining all the measures of a given construct. This approach results in a model that is formally identical to one in which only a single indicator is available, but in general a composite single indicator should be more reliable than a true single-item measure. In fact, it is possible to compute a measure of reliability when a composite of items is available (e.g., coefficient a), and this estimated reliability can be incorporated into the analysis by fixing the error variance of the indicator to (1 - reliability) times the variance of the indicator. This method has the advantage that the specification of the model is quite simple and that, compared to the true single-indicator case, unreliability of measurement can be taken into account in a limited way. However, a major disadvantage is that the quality of construct measurement is not investigated explicitly (e.g., no assessment of unidimensionality is provided). In the partial aggregation and partial disaggregation models, subsets of items are combined into several composites and these composites are treated as multiple indicators of a given factor. This method takes into account unreliability more explicitly and allows some assessment of unidimensionality while minimizing model complexity. However, combining subsets of items into composites is usually somewhat arbitrary. Finally, in the total disaggregation model true single-item measures are used as multiple measures of an underlying latent variable. This method allows the most explicit tests of the quality of construct measurement, but unfortunately the analysis

144

H. Baumgarmer, C Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

becomes rather unwieldy if more than, say, five indicators per factor are available and the model contains even a moderately large number of constructs. Table 1 presents relevant statistics regarding the issue of construct measurement, both overall and by type of model. Overall, the median number of observed variables ( p + q) across all applications of SEM has been 11 and the median number of constructs (m + n) has been five, resulting in a median ratio of observed variables to constructs of about two. For type II models, this ratio is identically equal to one by definition. In pure measurement studies, the median number of items per factor is about four, whereas in integrated measurement/latent variable models the median is around two. An unexpectedly large number of models contained at least one single-indicator 'latent' construct. By definition, all constructs are indicated by a single measure in type II models. However, even in type III models at least one single-indicator construct was used in 71 percent of all cases. For type I models this figure was much lower (7 percen0, but this should not be too surprising given that these studies deal exclusively with construct measurement. Single-indicator constructs are unattractive because they ignore unreliability of measurement, which is one of the problems SEM was specifically designed to circumvent. As noted by Bentler and Chou (1987), even having two measures per factor might be problematic since three indicators per construct are needed for a model to be identified unless covariances among factors help to identify the system of equations. Furthermore, as discussed by Bentler and Bonett (1980) and Anderson and Gerbing (1988), a model of independence at the structural level serves a very useful function in model comparison tests, and such a model is in general not identified when fewer than three indicators are available. These arguments suggest that authors should not use single-indicator constructs or the minimum number of indicators required for multi-item measurement of constructs. Instead, we recommend that each latent

variable be assessed with a minimum of three or four indicators each (cf. Bollen, 1989). As mentioned previously, not all single-item constructs are true single-item measures. Authors sometimes aggregate items into composites before entering them into the analysis. We also assessed the number of actual items that went into each measure, and the median of this variable (18) was substantially higher than the median number of formal indicators on which the degrees of freedom are based (11). In fact, across all three models items were combined into composites prior to entering them into a structural equation model in 38 percent of the cases. This practice was particularly prevalent in the case of type II models, where 77 percent of applications followed this procedure. Sometimes, it will be practically unavoidable to combine items into composites if the number of indicators is even moderately large (e.g., if one of the constructs is a personality trait which is measured by a battery of, say, ten items). In this case we recommend that a (confirmatory) factor analysis be conducted on the items to be aggregated and that evidence on the dimensionality of the construct be presented. If unidimensionality holds, the items may be aggregated into a single composite and measure unreliability can be taken into account by fixing the error variance appropriately, as described previously. If unidimensionality does not hold, only indicators of a given subdimension of the construct should be combined and the resulting composites can be treated as multiple indicators of a higher-order construct, provided the intercorrelations are high enough. Otherwise, separate constructs will have to be specified. Unreliability can be taken into account as in the single-composite case. In terms of further characteristics of construct measurement practices, Table 1 shows that about 20 percent of all measurement models contain doubleloading items. Although not advisable in general (Gerbing and Anderson, 1988), in about half of these cases the procedure is justified because of a priori specifications of method factors in multi-trait, multi-

Note to Table 1: a Table entries are medians (with the 25th and 75th percentile in parentheses), unless a percentage value is indicated.

H. Baumgartner, C. Homburg~Intern. J. of Research in Marketing 13 (1996) 139-161

145

~-'~

,q'~'~

~ , 4 ~ ' ~

"

-6

:~
d

~
z

~
= ~

~-~
_

~3

i~~-'~.~ _ ~ o~ ~ o
.~EE
~4

~._= ~~

= ,

~=

'~

&

~ .o_

~ .

= .~

=-~

-~

~.,-~ .~.~

t=

#
"8
"8o
o

"=

"=

0 .~ -~

~#~&~

~ & ~

.~,,~

66666

6666

~z

~.o

.~ .~ .~ .~ .~

~ .~ .~ .~ .~

,=
~z ~.~

146

H. Baumgartner, C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

method analyses (cf. Bagozzi and Yi, 1991; Bagozzi et al., 1991). About 5 percent of type I models and 19 percent of type III models contain correlated measurement errors. We specifically coded whether the authors provided a substantive justification for this practice or whether they introduced correlated errors simply because of goodness of fit considerations. In about 50 percent of the cases no justification was provided. We concur with Anderson and Gerbing (1988) and Hattie (1985) that unidimensional construct measurement is a desirable characteristic of measurement models, and we recommend that researchers use double-loading items and correlated errors of measurement with caution and not introduce them simply to boost the fit of the model. Latent variable model specification. We coded how many of the models were nonrecursive, either because of correlated errors in equations or because of entries both above and below the diagonal in the B matrix. Approximately 23 percent of type II models and 26 percent of type III models contained correlated errors in equations, of which 57 and 16 percent were substantively justified, respectively (see Table 1). Correlated errors in equations are sometimes useful to model correlations among the endogenous constructs that are due to unmeasured or omitted variables, but as in the case of correlated errors of measurement they should be used with caution. About 7 percent of type II models and 8 percent of type III models allowed simultaneous effects among the endogenous constructs, and overall 30 percent of type II models and 32 percent of type III models were nonrecursive. Given the complexity of nonrecursive models, these figures are fairly high and, as shown below, the lack of proper regard for issues such as model identification indicates a potential cause for concern. We recommend that if nonrecursive models are specified, relevant issues such as model identification and the stability of reciprocal effects be addressed explicitly in the paper. Sample size. Another important issue that should be considered prior to actually conducting the study is whether the sample size is likely to be sufficient given the number of parameters to be estimated. All methods for the estimation and testing of structural equation models are based on asymptotic theory and the sample size has to be 'large' for the parameter estimates and test statistics to be valid. Little theoret-

ical guidance as to what constitutes an adequate sample size is available and the evidence from simulation studies is sparse, but Bentler and Chou (1987) provide the rule of thumb that under normal distribution theory the ratio of sample size to number of free parameters should be at least 5:1 to get trustworthy parameter estimates, and they further suggest that these ratios should be higher (at least 10:l, say) to obtain appropriate significance tests. Table 1 shows that, overall, the median number of parameters estimated was about 29, with the median somewhat smaller for type II models. The median sample size was 178, resulting in a median ratio of sample size to number of free parameters of about 6:1. The median ratio is smallest for type III models at about 5:1. A total of 41 (73) percent of all models had ratios smaller than 5:1 (I0:1). In particular, the figures are lowest for type III models. In fact, for 86 percent of these models the ratio of sample size to number of parameters estimated was smaller than 10:1. These figures show that sample sizes are often toward the lower end of, or even below, levels that are considered acceptable to obtain trustworthy parameter estimates and valid tests of significance. It is fairly easy to calculate beforehand what the likely number of parameters to be estimated will be, and necessary sample sizes should be determined accordingly. As pointed out by Martin (1987), there may be a trade-off between collecting data of high quality and gathering data from a large sample of respondents. A researcher's primary objective should be to obtain high-quality data so that SEM may not be an appropriate methodology in certain cases. In particular, even though SEM can in principle be used in experimental designs (Bagozzi and Yi, 1989), small sample sizes will often preclude application of the technique in that context. Model identification. Another important a priori consideration is whether the model to be estimated is identified. A model is said to be identified if it is impossible for two distinct sets of parameter values to yield the same population variance-covariance matrix. A necessary condition for identification is that the number of parameters to be estimated should not exceed the number of distinct elements in the variance-covariance matrix of the observed variables. This rule, which implies that the number of degrees of freedom be nonnegative, is easy to check,

H. Baumgarmer, C. Homburg~Intern. J. of Research in Marketing 13 (1996) 139-161 but unfortunately it is not a sufficient condition for identification. General, easy-to-follow procedures for proving identification are unavailable except in specialized cases, and showing that a model is identified may be nontrivial for certain kinds of models. In particular, special care is required for models that are not unidimensional a n d / o r nonrecursive. We coded whether the issue of identification was addressed in a given article. Table 1 shows that very few authors mention that they checked whether a model to be estimated was identified. It is possible that identification was considered without explicitly mentioning this fact, and in many cases computer programs will give a warning message if a model appears to be underidentified, so that identification problems might be detected even if identification is not proved theoretically. However, particularly when it is not immediately clear that a model is identified, it would be advisable to deal with identification explicitly and to mention this fact in the paper. Our own assessment of model identification in previous applications of SEM suggests that the vast majority of models are indeed identified theoretically. However, in at least two instances the target model was probably not identified and in another paper one of the comparison models was underidentified. It is thus strongly recommended that in the future more attention be paid to the issue of identification. 5 Degree of freedom. As a final characterization of previous applications of SEM, Table 1 shows that the median number of degrees of freedom and thus overidentifying restrictions is about 32. The median is lowest for type II models at 11 and highest for type III models at 49. In a surprisingly large proportion of all cases degrees of freedom are reported incorrectly (8 percent). The reason for this error is that a correlation matrix is used as input to estimation, at least one exogenous construct is measured by a single indicator, the loading of the single indicator on its 'construct' is constrained to one, and at the same time the variance of the single-indicator con-

147

struct is assumed to be known and also fixed at 6 one. This results in inflated degrees of freedom and exaggerated p-values for the overall goodness-of-fit X 2 statistic. Fortunately, the error is generally small in magnitude. In most cases, it is possible to separate the measurement model from the latent variable model and to partition the total number of degrees of freedom into degrees due to the particular specification of the measurement model (essentially the degrees of freedom of a saturated structural model) and the degrees of freedom derived from a particular structural model formulation (the deviation of a model from a saturated structural specification). Table 1 indicates that, overall, most degrees of freedom derive from the measurement model. In type I models deviations from the saturated structural model are rare (i.e., cases where factor covariances are specified to be fixed), and in type II models overidentifying restrictions can only be imposed on the structural model. In type III models, where degrees of freedom may come from both the measurement and latent variable models, the median contribution of the measurement model to the total number of degrees of freedom is about 93 percent. Duncan (1975) and Forneil (1983) have pointed out the dangers of interpreting a good overall fit of a model as support for the validity of one's theory. The figures on the percentage of overidentifying restrictions that are generally derived from the measurement model provide further evidence on the folly of such arguments. 3.2. Issues related to data screening prior to model estimation and testing Probably one of the most common mistakes in applying SEM is to pay little or no attention to the raw data and to immediately compute a correlation matrix and rush to model estimation and testing. The danger inherent in this practice is that the correlation matrix masks the multitude of factors that may call into doubt the applicability of the chosen statistical procedure.

5 Even if a model is identified theoretically, there might be empirical identification problems. This happens when the expression of a parameter in terms of observed variances and covariances involves a denominator that is zero or close to zero (cf. Kenny, 1979).

6 In other words, the scale of a latent variable should be fixed by setting either the loading of one reference indicator or the factor variance equal to unity (but not both).

148

H. Baumgartner, C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

First, it is important to check that there are no coding errors, that variables have been recoded appropriately if necessary, and that missing values have been dealt with properly (Kaplan, 1990). Second, it is helpful to investigate possible distorting influences introduced by the presence of a few influential outliers. Third, it is crucial to examine the approximate normality of the data and to take corrective action if this assumption is violated, since most estimation methods assume that the data come from a multivariate normal population. Finally, it is essential that an appropriate measure of association be used as input to model estimation and testing. Recent discussions of SEM have placed increased emphasis on such important issues as outlier detection and assessment of normality (see Bollen, 1989 for an excellent discussion). The necessary analyses can now be conducted fairly easily with conventional computer programs for SEM (e.g. EQS, PROC CALLS in SAS) or specialized programs such as PRELIS (J6reskog and S~Srbom, 1993b) and the LISRES macro for SAS (Davis (1992), based on the work of Bollen (1989), and Bollen and Arminger (1991)). PRELIS is helpful for exploratory data screening and for testing the univariate and multivariate normality of the observed variables. The LISRES macro can be used to flag 'atypical' cases and to perform outlier analysis for groups of observations, and it also reports univariate and multivariate tests of normality based on the skewness and kurtosis of the observed variables (see Bollen (1989), for details). An assessment of the approximate normality of the data is important because model estimation and testing are usually based on the validity of this assumption, and lack of normality adversely affects goodness-of-fit indices and standard errors. There was little evidence in most of the papers that the authors had shown particular concern for data screening prior to model estimation and testing. Due to space constraints data screening is probably not described in most cases and the idea of outlier detection in SEM is a relatively new one. However, it is widely known that SEM generally requires the assumption of multivariate normality, and we coded whether assessment of normality was discussed either qualitatively or quantitatively. Only in 8 percent of all cases did authors indicate that they had checked whether the data were at least approximately normal.

Alternative estimation techniques for which multivariate normality is not as crucial as for maximum likelihood estimation (see discussion below) were used very infrequently (5 percent), and in no instance was their use motivated by the fact that the data were not sufficiently normal. These results indicate that not enough attention is being paid to the satisfaction of assumptions that are necessary for the valid use of the most commonly used estimation technique. We recommend that in future research the requirement of multivariate normality be taken more seriously and that a summary measure describing the extent to which the data are normally distributed be reported as a matter of routine in the methods or results section of the paper (e.g., the multivariate coefficient of relative kurtosis; cf. Browne, 1982). In addition to the issue of data screening, there is also the question of which measure of association to use in the analysis. Researchers often use correlations rather than covariances as input to estimation, and Cudeck (1989) has recently discussed this issue in some detail. Since in most cases maximum likelihood (as well as generalized least squares) fitting functions are scale invariant and the resulting estimates scale free (Bollen, 1989), this has no effect on overall goodness of fit indices and parameter estimates. However, standard errors may be inaccurate, and Cudeck (1989) cautions against the use of correlation matrices. Our survey of previous applications of SEM shows that in many cases researchers do not mention specifically which measure of association they used as input to estimation. If the conservative criterion is adopted that a correlation is used by default, about 78 percent of applications have based estimation on correlation matrices. Covariances were specifically used in 21 percent of all cases and in a few applications another measure of association was employed. It is difficult to assess how frequently the use of correlations has had detrimental effects on the analysis. One case where correlations should not be used is with multiple-group analyses. In seven instances only the results of multiple-group analyses were reported in the paper. In four of these cases no mention was made of the fact that the analyses were based on covariances, and it is possible that correlations were used as input. We recommend that in future research all analyses be conducted on covariance matrices and that tests of significance for indi-

H. Baumgartner, C. Homburg~Intern. J. of Research in Marketing 13 (1996) 139-161

149

vidual parameters be reported from this analysis. Standardized parameter estimates can be easily obtained from the standardized solution in which either the latent variables or both the latent and observed variables have been standardized.

3.3. Issues related to the estimation and testing of theoretical models on empirical data Model estimation. A variety of estimation procedures are available to obtain parameter estimates and test statistics, but the majority of models (about 95 percent) are estimated using maximum likelihood techniques (see Table 2). The reason for this preference for maximum likelihood techniques seems to be that it is the default method in most computer packages. Furthermore, given the lack of concern for the validity of the normality assumption, authors probably do not see a need to consider alternative estimation procedures. In principle, asymptotically distribution-free (ADF) methods (Browne, 1984) can be used regardless of which distribution underlies the observed variables, but in practice very large sample sizes are required, and simulations have shown that ADF techniques do not necessarily perform better even when they might be expected to be more appropriate theoretically (cf. Hu et al., 1992; Sharma et al., 1989). Given the small sample sizes on which correlations or covariances are generally based, ADF techniques are probably not a practical alternative in most situations. Estimation problems. Sometimes estimation techniques encounter difficulties in converging on a solution or converge on a locally optimal solution, and even when a solution has been found it might be improper in the sense that the estimated parameters are impossible in the population (e.g., negative error variances) or of little use in testing (e.g., very large standard errors). Frequent causes of such problems are poorly specified models, overfitting, outliers, bad starting values, insufficiently operationalized constructs, and small sample sizes (Bentler and Chou, 1987). Few instances of estimation problems were encountered in our review (see Table 2). With respect to convergence problems, this is not too surprising since nonconvergence would probably preclude publication of an article. One area where convergence

problems were sometimes mentioned was in the context of multi-trait multi-method analysis. Improper solutions in the form of negative error variances were found in 5 percent of all models (although they were not always significant). Sometimes the offending estimate was set to zero, and sometimes it was retained. It should be noted that estimated model parameters are not always reported in enough detail so that it is sometimes impossible to tell whether improper solutions occurred or not. We would recommend that authors at least mention that there were no improper solutions if complete results are not reported because of space constraints or other reasons. Assessment of overall model fit. The most popular index for assessing the overall goodness of fit of a model has been the X2 statistic, which tests the null hypothesis that the estimated variance-covariance matrix deviates from the sample variance-covariance matrix only because of sampling error. In practice, the X 2 test is sometimes of limited usefulness because it is not robust to violations of underlying assumptions (particularly normality) and because it is heavily influenced by sample size (Bentler, 1990). The latter problem is particularly serious because on the one hand large sample sizes are needed to obtain valid tests, but on the other hand specified models are probably never literally true and thus subject to rejection in sufficiently large samples (cf. Cudeck and Browne, 1983). Because of these problems, many alternative fit indices have been developed (for recent overviews see Gerbing and Anderson, 1993; J/Sreskog, 1993; Marsh et al., 1988; Mulaik et al., 1989; Tanaka, 1993). Some of these are stand-alone indices assessing model fit in an absolute sense (e.g., the goodness of fit index (GFI), adjusted goodness of fit index (AGFI), and root mean square residual (RMR) reported by earlier versions of LISREL) and others are incremental fit indices comparing the target model to the fit of a baseline model, among them the Bentler and Bonett (1980) normed fit index (BBI) and the Tucker and Lewis (1973) nonnormed fit index (TLI). Most of the time the baseline model is one in which all observed variables are assumed to be uncorrelated, although other baseline models are possible. Recent work on goodness-of-fit assessment has emphasized the idea of expressing model fit in terms

150

H. Baumgartner, C. Homburg~Intern. J. of Research in Marketing 13 (1996) 139-161

~ ~ - - ~ _ ~ ~ ~ - -

E
0
e.

2
0

-~

_ ~ _ ~ _

oo

H. Baumgartner, C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

151

e.

.=_

.8

~e

e~ ea

"7.

!
e~

,o

[",7.

152

H. Baumgarmer. C. Homburg~Intern. J. of Research in Marketing 13 (1996) 139-161

of noncentrality (e.g., Bentler, 1990; McDonald and Marsh, 1990; Steiger, 1990; Browne and Cudeck, 1993). This explicitly recognizes the fact that hypothesized models are generally only approximately true, provides a basis for population-based (rather than sample-based) fit measures and associated confidence intervals, and appears to mitigate the problem that the means of the sampling distributions of many alternative fit measures are a function of sample size (with larger samples yielding larger fit indices on average). Among the stand-alone fit indices based on noncentrality are the McDonald (1989) measure of centrality (MC) and the root mean squared error of approximation (RMSEA) of Steiger (1989) and Steiger (1990), which estimates how well the fitted model approximates the population covariance matrix per degree of freedom. Browne and Cudeck (1993) suggest that a value of RMSEA below 0.05 indicates close fit and that values up to 0.08 are reasonable, and they propose a test of close fit for testing the hypothesis that RMSEA is smaller than 0.05. (In contrast, the conventional X2 statistic tests the hypothesis that RMSEA = 0). Among the incremental fit indices based on noncentrality are the Bentler (1990) normed comparative fit index (CFI), which in most cases equals the McDonald and Marsh (1990) nonnormed relative noncentrality index, and the Tucker-Lewis index (TLI). The most important difference between CFI and TLI is that TLI (like RMSEA) expresses fit per degree of freedom, thus imposing a penalty for estimating less parsimonious models. This may be important in comparing models of different complexity. Table 2 shows that most published articles using SEM report at least one stand-alone fit index, and slightly more than a third also rely on incremental fit indices to assess the overall fit of the model. As expected, by far the most commonly used fit index is the X 2 test. Other common stand-alone indices are GFI, AGFI, and RMR. Until recently these were the ones automatically reported by LISREL, which seems to explain their popularity. Incremental fit indices are used less frequently, probably because, until recently, one had to estimate a separate baseline model and calculate the index by hand. BBI has been the most popular incremental fit index, but CFI has recently gained in popularity. The latest versions of the most common computer

programs for SEM (e.g., LISREL 8, the PROC CALLS procedure in SAS) report a large number of different fit indices so that authors will have to make a decision about which ones to use in model evaluation. Bollen and Long (1993) recommend that researchers should not rely solely on the X 2 statistic but report multiple fit indices representing different types of measures (i.e., other stand-alone indices besides the X 2 statistic, such as RMSEA, and incremental fit indices such as CFI and TLI; see Tanaka, 1993, for a discussion of different dimensions along which fit indices can be classified). Table 2 indicates that researchers base model evaluation too much on the X 2 test, and we suggest that alternative fit indices (particularly those based on noncentrality) be used more widely in future applications. Table 2 also reports summary statistics on the distribution of goodness-of-fit measures across previous applications of SEM. The X: statistic by itself is not a meaningful statistic without taking into account the degrees of freedom of a model so that the ratio of X 2 to degrees of freedom ( x 2 / d f ) is reported. If either GFI or AGFI was provided in the paper, it is generally possible to calculate the other index, and this was done to increase the sample size. In the case of RMR, only models in which the RMR was based on a correlation matrix are included in the sample. Summary statistics are also reported for MC and RMSEA since they might become more widespread in the future. Besides the three incremental fit indices that were encountered with some degree of frequency (BBI, TLI, and CFI), Table 2 also reports the relevant figures for the relative fit index (RFI) or p~ and the incremental fit index (IFI) or /I 2, both of which are due to Bollen (1989). Provided that at least one incremental fit index was reported in the paper, it is possible to calculate the X 2 of the baseline model and thus any other incremental fit index. Whenever possible, this was done to increase the sample size on which the summary statistics are based. Only models for which the baseline model was the one of complete independence among all measures were used. The effective sample sizes for the norms on x 2 / d f , GFI, AGFI, RMR, MC, RMSEA, BBI, TLI, CFI, RFI, and IFI are provided in brackets in Table 2. The medians for x 2 / d f , GFI, AGFI, RMR, MC, and RMSEA were 1.62, 0.95, 0.91, 0.05, 0.95, and

H. Baumgartner, C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

153

0.06, respectively. In 54 percent of all cases the hypothesized model was inconsistent with the data based on a X 2 goodness-of-fit test at a critical value of 0.05. The corresponding figures for type I, II, and III models were 61, 21, and 60 percent. Thus, type II models tend to achieve better fits than the other two types of models. The reason for this seems to be that these models are generally less complex than type I and type III models. The medians for BBI, TLI, CFI, RFI, and IFI were 0.91, 0.93, 0.95, 0.85, and 0.95. The relative magnitude of the various incremental fit indices is consistent with theoretical expectations (cf. Bollen, 1989). The percentage of applications in which the GFI, AGFI, BBI, TLI, and CFI were smaller than 0.9 (the value presumably indicating acceptable fit for these indices) was 24, 48, 44, 32, and 21, respectively. Although the figures for the incremental fit indices should be interpreted with caution because of the relatively small number of applications on which they are based, it is apparent that the 0.9 value is a reference point that many models do not achieve. For RMSEA, 58 percent of all models had values above 0.05, and 23 percent had values exceeding 0.08. Thus, a sizable proportion of published models falls short of what Browne and Cudeck (1993) call a reasonable fit. In an effort to ascertain determinants of the overall goodness of fit of a model, X 2/df, GFI, AGFI, MC, RMSEA, BBI, TLI, CFI, RFI, and IFI were correlated with the variables listed in Table 1 and on the bottom of Table 2. 7 Two important sets of results emerged from this analysis. First, X 2/df was fairly strongly correlated with sample size ( r = 0.47). This confirms the problematic dependence of the X 2 test on sample size. In addition, AGFI ( r = 0.22), BBI ( r = 0.31), and RFI ( r = 0.35) had a significant

7 The possible determinants of the overall goodness-of-fit of a model were submitted to a principal components analysis to identify clusters of variables that behave similarly across samples and to aid in the interpretation of the effect of a given variable. This analysis showed, for example, that number of observed variables, number of latent variables, number of parameters estimated, degrees of freedom, contribution of the measurement model to the overall number of degrees of freedom loaded on a single factor which reflects the complexity of a model.

and nonnegligible relationship with sample size, corroborating earlier theoretical discussions and simulation evidence (with the exception of GFI) that the means of the sampling distributions of these indices are a positive function of sample size (cf. Bollen, 1989; Marsh et al., 1988; McDonald and Marsh, 1990). Second, there were sizable (negative) effects of model complexity (in terms of number of observed variables, number of observed variables per factor, number of parameters estimated, degrees of freedom, and contribution of the measurement model to the overall number of degrees of freedom) on GFI and MC ( r ' s > 10.51), BBI ( r ' s > 10.41), AGFI ( r ' s > 10.31), and CFI, RFI, and IFI ( r ' s > 10.21). In contrast, x2/df, RMSEA, and TLI were unaffected by model complexity. These results indicate that model complexity is an important factor contributing to the contingent nature of goodness-of-fit assessments, and they suggest that general rules of thumb (e.g., that GFI or BBI be greater than 0.9) may be misleading because they ignore such contingencies. On the positive side, x 2 / d f , RMSEA, and TLI seem to be effective in controlling for model complexity by assessing fit per degree of freedom. Assessment of the measurement model. The quality of construct measurement is ascertained by looking at the sign, size, and significance of estimated factor Ioadings and the magnitude of measurement error. Various indices of reliability can be computed to summarize how well the constructs are measured by their indicators, either at the individual item level (individual-item reliability) or for all measures of a given construct jointly (composite reliability, average variance extracted; cf. Alwin and Jackson, 1979; Bagozzi and Yi, 1988; Fornell and Larcker, 1981; Steenkamp and van Trijp, 1991). We initially attempted to compute summary measures of measurement reliability for a given model, but this proved too difficult because of the diversity of approaches used by different authors to assess the quality of construct measurement. This makes comparisons across studies almost impossible. For example, in some cases reliability is computed based on estimated model parameters. In other cases, coefficient ot is reported for composites of items that are used either as single-indicator 'constructs' or as multi-item individual indicators. Eventually, we only coded whether authors showed any concern for mea-

154

H. Baumgarmer. C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

sure reliability by reporting at least one of the various possible reliability indices. Overall, 78 percent of all applications mentioned some form of reliability assessment (see Table 2). The figures for type I, type II, and type III models were 78, 70, and 82 percent, respectively. The practice of examining measure reliability is least common in type II models, which is not surprising since construct measurement is not modeled explicitly in this case and since reliability cannot be assessed if true single-item measures are used in the analysis. Although reliability assessment is more common in type I and type III models, there is room for improvement even in these cases. We recommend that in the future authors report at least one measure of construct reliability which is based on estimated model parameters (e.g., composite reliability, average variance extracted). Coefficient ct is generally an inferior measure of reliability since in most practical cases it is only a lower bound on reliability. In particular, if coefficient a is entered into the analysis as an external estimate of reliability, it will usually exaggerate unreliability of measurement. Furthermore, in cases where composites of measures are used as individual items, authors should report supplementary evidence on unidimensionality based on factor analyses. Coefficient a is insufficient for this purpose since a scale may not be unidimensional even if it has high reliability (Gerbing and Anderson, 1988). Assessment of the latent variable model. In models of type II and III, the latent variable model represents the hypotheses of interest. The hypotheses are tested by examining the sign, size, and statistical significance of the structural coefficients. In addition, it is useful to report the percentage of variation in the endogenous constructs accounted for by the exogenous constructs (i.e., the R 2 for each structural equation). If the model is nonrecursive, these figures have to be interpreted with caution (cf. Teel et al., 1986). Several authors (e.g., Fornell, 1983) have stressed the importance of distinguishing between variance fit (explained variance in the endogenous variables) and covariance fit (overall goodness of fit statistics testing the applicability of the overidentifying restrictions imposed on the model, such as the X 2 test). It seems that authors are sometimes not completely

clear on the difference between the two and that the emphasis on covariance fit detracts from a proper concern for variance fit. We specifically coded whether papers reported evidence on variation accounted for, at least for some of the endogenous variables. For type II models this was the case in only 30 percent of all cases, and for type III models the figure was 45 percent. Although we do not have any hard evidence on why these figures are so low, we suspect that one of the reasons might be that since the goodness of fit of the overall model was assessed by means of X 2 and related statistics, authors see no need to report what in regression terminology would be called the goodness of fit of each structural equation (i.e., R2). Since covariance fit says nothing about variance fit - a model might fit well but not explain significant amounts of variation in endogenous variables, or conversely fit poorly and explain a large portion of the variance in endogenous variables (Fornell, 1983) - it is recommended that authors report the R 2 for each structural equation. We hasten to add, however, that the amount of variance explained by a structural equation is only one consideration in evaluating a model and that the meaningfulness of individual structural parameters is probably of greater importance in most cases. Model modification. It is quite unlikely that the model that is initially specified as a plausible representation of the data (with all items that were collected to measure a given construct included in the analysis and all model parameters estimated based purely on a priori considerations) will be the one that is eventually presented as the most parsimonious summary of the data. There is, however, considerable variation across studies in how readily this fact is acknowledged. Some authors describe in great detail how the measurement model was purified and what modifications were made to the structural model to obtain acceptable goodness-of-fit statistics. Other authors present only the final model and provide no evidence on the process that led to it. It is thus difficult to evaluate how much the initial model was modified. Various tools are available to locate model misspecifications, including modification indices and residual analysis. Using modification indices, which are reported routinely in the output of LISREL and other programs, researchers can conduct specifica-

H. Baumgartner, C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

155

tion searches almost automatically and possibly improve the fit of a model to acceptable levels. Unfortunately, simulation work by MacCallum (1986) and Homburg and Dobratz (1992) indicates that specification searches (particularly modifications of the latent variable model) can go astray and fail to uncover the correct underlying model, particularly when the original model has many specification errors, when the sample size is small, and when the search is guided solely by a desire to improve the overall fit of the model. Since models arrived at through specification searches are rarely cross-validated, it is quite possible that authors are capitalizing on chance when respecifying models that have been found to be lacking in fit. We recommend that model modifications be strongly guided by substantive considerations and that constraints having large modification indices be relaxed only if the resulting parameter change is theoretically and practically meaningful. The expected parameter change statistic available in some computer programs should prove helpful in this regard (Kaplan, 1990). As argued by Cudeck and Browne (1983), SEM is best conducted in the form of comparisons among different plausible models that are nested in each other and can be justified theoretically. At the structural level, the decision-tree framework suggested by Anderson and Gerbing (1988) may be quite helpful in this regard. Besides avoiding the dangers of specification searches which are not guided by theory, the comparison of different models should also guard against the frequently encountered but mistaken notion that a theoretical model which achieves an acceptable (covariance) fit has somehow been shown to be the most plausible representation of the data. Such thinking ignores the very real possibility of model equivalence (i.e., two different parametric structures with possibly very different theoretical implications summarize the data equally well as shown by equivalent X 2 values; cf. Stelzl (1986), and Luijben (1991)) or near model equivalence (i.e., two different models are not formally equivalent but more or less equally consistent with the data; cf. Breckler (1990)). As shown in Table 2, in 54 percent of all applications authors acknowledged some form of specification search. These included deletion of items because of low item-total correlations or bad performance in

exploratory or confirmatory factor analyses, specifications on the basis of preliminary explorations of the data (e.g. through exploratory factor analysis), respecifications of the measurement model on the basis of modification indices or residual analysis (e.g., introduction of correlated measurement errors), addition of structural paths or correlated errors in equations because some overidentifying restrictions were not met, and pruning of the model of nonsignificant parameters on the basis of t-values or X 2 difference tests. Model comparisons were performed in 31 percent of all cases, and cross-validation defined broadly as the estimation of the same model on at least two sets of data, with no requirement that the two models be compared explicitly using multisample analysis procedures or the cross-validation approaches suggested by Browne and Cudeck (1989), Cudeck and Browne (1983), and Homburg (1991) was conducted in only 21 percent of the cases. The low incidence of cross-validation coupled with the presumably frequent practice of searching for an acceptable model specification (we would venture to guess that most models reported in the literature have been modified at least to some extent) imply that the replicability of findings obtained through SEM may often be doubtful (cf. Steiger, 1990). Furthermore, the figure on how often a target model is compared to alternative specifications suggests that the benefits of model comparisons have not been realized by many authors. We recommend that in the future authors be frank in their reporting of how they arrived at their final model and that a strategy of model comparison be adopted whenever possible. The expected cross-validation index of Browne and Cudeck (1989), Browne and Cudeck (1993) and related model selection criteria may prove helpful in this regard. Residual analysis. Traditionally, residual analysis in the context of SEM has referred to an examination of the difference between the observed (sample) variance-covariance matrix and the estimated variance-covariance matrix implied by a particular model specification. This is in contrast to regression analysis, where residual analysis refers to an examination of the difference between observed and predicted values of a dependent variable. Recent work by Bollen and Arminger (1991) shows that modelbased residual analysis (of both estimated errors in

156

H. Baumgartner, C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

~
"-'= ~

,~
6

~
~.

~,=
.-=.-=

~,~._

~,.~-

~ -~,..~

._
~'~.

~I.~

." ~

.-._- ~ ~

~.~

-~ :
~-:~.= ,,

~,'. ,~....

.~ ~o ,,

.~

~~~I

~.~N

.~-

~
;< ,e

ea

-~~
o ~a

.~
~

~.~
~

~ ~
,-

- - ~ ~ _~~

=, ~: ~
'

.~

~ ~.

~ ~=.~- ~

o.~ , . ~ ~ ~

~
o

'~
"~-~ '~ ~ ~ 'a ,~ ~.,--

$.
.=_

~.~
.=_
E

H. Baumgartner, C. Homburg~Intern. J. of Research in Marketing 13 (1996) 139-161

157

=~=

.-.-_

,~

.~. ~.

,~ .,-.

f~

0 ~'~

'~

:" =

"-'

''

"~

,~

"-

~'~-

o=

~.~

,I

.~ ~ ~ . ~
,-

~ ~ ~ ~
,,, ,~ ~ ~,

,~ ~ . ~
~ ,,,

,,,,~

. ~ ~

_.,,

,~ ~

"~

.~

_~ ~.

-~

158

H. Baumgartner, C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

variables and errors in equations) may be useful in the detection of outliers and influential cases and in the assessment of assumptions such as normality. The LISRES macro discussed previously (Davis, 1992) implements the Bollen and Arminger (1991) ideas and allows researchers to conduct sensitivity analysis on their model. This work is too recent to be reflected in previous applications of SEM, but we recommend that researchers give serious consideration to using these tools in their work.

4. Discussion Structural equation modeling has become an established component of the methodological repertoire of marketing and consumer behavior researchers. There are at least two features that make SEM an attractive candidate for purposes of data analysis. First, SEM allows the researcher to take into account explicitly the inherent fallibility of behavioral science data and to assess and correct for measure unreliability provided multiple indicators of each construct are available. Second, SEM makes it possible to investigate in a straightforward fashion comprehensive theoretical frameworks in which the effects of constructs are propagated across multiple layers of variables via direct, indirect, or bi-directional paths of influence. These advantages, coupled with the development of ever more sophisticated, yet surprisingly user-friendly computer programs to estimate and test such models, make it rather likely that SEM will enjoy widespread use in future research. As with any other research tool that offers powerful data analysis capabilities, however, SEM has to be used prudently if researchers want to take full advantage of its potential. Based on prior methodological discussions, our own work in the area, and particularly the present review of previous applications of SEM in the Journal of Marketing, Journal of Marketing Research, International Journal of Research in Marketing, and the Journal of Consumer Research, we would offer the following general guidelines to future users of these techniques. First, careful thought should be given to model specification issues before empirical data are ever collected. Experience indicates that structural equation models

are most profitably specified for relatively well-defined theoretical frameworks of moderate complexity in which each construct is measured by a fairly compact set of indicators. SEM is usually not the most useful technique in the early, exploratory stages of research when the measurement structure underlying a set of items is not well established and theoretical guidance concerning possible patterns of relationships among constructs is lacking. Furthermore, although fairly complex models are specified quite easily, it is generally advisable to refrain from formulating models that are too grandiose because the analysis easily degenerates into an exercise in data mining, resulting in models with suspect statistical properties and questionable substantive implications (cf. Bentler and Chou, 1987). Second, once the data are available, they should be screened carefully before a moment matrix is computed and particular models of interest are investigated. We specifically recommend that researchers make greater use of the diagnostic tools that are beginning to appear in the literature (e.g., outlier analysis as implemented in the LISRES macro) and that some evidence of approximate normality based on skewness and kurtosis be presented in the paper (as reported in PRELIS, EQS, etc.). As pointed out by Martin (1987), among others, the powerful capabilities of SEM derive partly from highly restrictive simplifying assumptions, and researchers have to make sure that these assumptions are not too grossly violated. Third, with regard to model estimation and testing, the consistency of a given model specification with empirical data should be assessed in terms of a variety of global and local fit measures, alternative theoretical models should be considered whenever possible, and research results should be cross-validated, particularly when the original model formulation was revised considerably through specification searches. Steiger (1990) argues this point most forcefully by stating, "Perhaps a moratorium should be declared on publication of causal modeling articles using any PMM [post hoc model modification] procedure.., unless such articles provide evidence of cross-validation" (p. 176). Although the replicability of findings is an important concern of scientific research in general, the ease with which data-driven model modifications are conducted using modifica-

H. Baumgartner, C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

159

tion indices and related statistics suggests that special attention be paid to this issue in future applications of SEM. Table 3 summarizes the major problem areas identified in our review of previous applications of SEM and offers specific recommendations about how to improve the practice of latent variable modeling. As this review of previous applications of SEM in four major marketing and consumer behavior journals has shown, LISREL and related techniques have already had a substantial impact on empirical research in our discipline. The question arises whether the widespread use of SEM has had a positive influence on research from a substantive perspective. 8 Several authors have voiced critical comments in this regard. For example, Martin (1987) suggests that the complexities of SEM may place undue emphasis on methodological aspects and divert attention from sound theorizing. Particularly in earlier applications of SEM, authors often felt compelled to explain the basics of the new technique in great detail (usually with the help of most letters of the Greek alphabet) and given the space constraints in journals, the result probably was a decreased concern with theory development. It is hoped that as familiarity with the methodology increases, researchers will be able to present technical matters more succinctly and accord greater attention to substantive issues. Another problem has been the careless use of causal terminology (Biddle and Marlin, 1987; Breckler, 1990; Cliff, 1983; Martin, 1987). Since SEM is almost always based on correlational data and since the vast majority of studies use data collected in a single wave, causal conclusions are usually unwarranted, and it is probably best not to use the term causal modeling. A related problem occurs in connection with the interpretation of a model that has been found to fit the data. Because of such issues as model equivalence or near model equivalence, a good fit should not be interpreted to imply that the proposed model is the 'true' representation of the structure underlying the data (Breckler, 1990). The

s W e thank an a n o n y m o u s reviewer for suggesting a discussion o f these issues.

only legitimate conclusion is that the proposed model is one possible plausible account of the data. By grounding hypothesized patterns of effects strongly in extant theoretical frameworks, by using panel designs in which data from the same subjects are collected at multiple points in time so that certain patterns of influence can be ruled out, and by comparing the proposed model to various competing specifications, the researcher can take steps to safeguard against likely alternative explanations. A final issue concerns the notion of SEM being a confirmatory method of analysis. If hypothesized models are truly specified a priori and no data-based model modifications are introduced, SEM is indeed used in a confirmatory manner. However, in practice respecifications of either the measurement model or the latent variable model or both are quite common, and SEM is then used in a more exploratory fashion. The conclusion that a particular model has been 'confirmed' by the data becomes suspect in such cases, and cross-validation is necessary to ascertain how well the model will hold up in an actual confirmatory analysis (cf. Biddle and Marlin, 1987; Breckler, 1990). Although some authors have taken a rather pessimistic view on the value of regression models and related techniques for empirical research - for example, Freedman (1991) argues that they "make it all too easy to substitute technique for work" (p. 300) we believe that on balance researchers have put the powerful capabilities of SEM to good use and that empirical work has benefited from the application of this methodology. In particular, we think that the explicit emphasis on multi-item measurement of constructs and the resultant ability to assess the validity and reliability of construct measurement has done much to bring the importance of these issues to the attention of marketing and consumer behavior researchers and to establish what Ray (1979) calls a marketing measurement tradition. It is clear that valid and reliable measurement is a prerequisite to theory testing, and SEM has certainly contributed to theory development in this sense (see also Bagozzi (1984)). We hope that our review of prior applications of SEM will further improve the quality of empirical research in marketing and consumer behavior and ultimately advance our understanding of substantive phenomena.

160

H. Baumgartner, C. Homburg/Intern. J. of Research in Marketing 13 (1996) 139-161

Acknowledgements The authors thank Nirmalya Kumar, Jan-Benedict Steenkamp, two anonymous reviewers, and the editor for helpful comments on previous versions of this paper.

References
Alwin, D.F. and D.J. Jackson, 1979. Measurement models for response errors in surveys: Issues and applications. In: K.F. Schuessler (ed.), Sociological Methodology 1980, 68-119. San Francisco: Jossey-Bass. Anderson, J.C. and D.W. Gerbing, 1984. The effects of sampling error on convergence, improper solutions and goodness-of-fit indices for maximum likelihood confirmatory factor analysis. Psychometrika 49, 155-173. Anderson, J.C. and D.W. Gerbing, 1988. Structural equation modeling in practice: A review and recommended two-step approach. Psychological Bulletin 103, 411-423. Bagozzi, R.P., 1980. Causal models in marketing. New York: Wiley. Bagozzi, R.P., 1984. A prospectus for theory construction in marketing. Journal of Marketing 48, 11-29. Bagozzi, R.P. and H. Baumgartner, 1994. The evaluation of structural equation models and hypothesis testing. In: R.P. Bagozzi (ed.), Principles of Marketing Research, 386-422. Cambridge, MA: Blackwell. Bagozzi, R.P. and T.F. Heatherton, 1994. A general approach to representing multifaceted personality constructs: Application to state self-esteem. Structural Equation Modeling 1, 35-67. Bagozzi, R.P. and Y. Yi, 1988. On the evaluation of structural equation models. Journal of the Academy of Marketing Science 16, 74-94. Bagozzi, R.P. and Y. Yi, 1989. On the use of structural equation models in experimental designs. Journal of Marketing Research 26, 271-284. Bagozzi, R.P. and Y. Yi, 1991. Multitrait-multimethod matrices in consumer research. Journal of Consumer Research 17, 426439. Bagozzi, R.P., Y. Yi, and L.W. Phillips, 1991. Assessing construct validity in organizational research. Administrative Science Quarterly 36, 421-458. Bass, F.M., 1969. A new product growth model for consumer durables. Management Science 15, 215-227. Bender, P.M., 1980. Multivariate analysis with latent variables: Causal modeling. Annual Review of Psychology 31,419-456. Bentler, P.M., 1989. EQS: Structural equations program manual. Los Angeles, CA: BMDP Statistical Software. Bentler, P.M., 1990. Comparative fit indexes in structural models. Psychological Bulletin 107, 238-246. Bentler, P.M. and D.G. Bonett, 1980. Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin 88, 588-606.

Bentler, P.M. and C.P. Chou, 1987. Practical issues in structural modeling. Sociological Methods and Research 16, 78-117. Biddle, B.J. and M.M. Marlin, 1987. Causality, confirmation, credulity, and structural equation modeling. Child Development 58, 4-17. Bollen, K.A., 1989. Structural equations with latent variables. New York: Wiley. Bollen, K.A. and G. Arminger, 1991. Observational residuals in factor analysis and structural equation models. In: P.V. Mardsen (ed.), Sociological Methodology 1991,235-262. Washington: American Sociological Association. Bollen, K.A. and J.S. Long, 1993. Introduction. In: K.A. Bollen and J.S. Long (eds.), Testing structural equation models, 1-9. Newbury Park, CA: Sage. Breckler, S.J., 1990. Applications of covariance structure modeling in psychology: Cause for concern? Psychological Bulletin 107, 260-273. Browne, M.W., 1982. Covariance structures. In: D.M. Hawkins (ed.), Topics in applied multivariate analysis, 72-141. Cambridge, England: Cambridge University Press. Browne, M.W., 1984. Asymptotically distribution-free methods for the analysis of covariance structures. British Journal of Mathematical and Statistical Psychology 37, 62-83. Browne, M.W. and R. Cudeck, 1989. Single sample cross-validation indices for covariance structures. Multivariate Behavioral Research 24, 445-455. Browne, M.W. and R. Cudeck, 1993. Alternative ways of assessing model fitl In: K.A. Bollen and J.S. Long (eds.), Testing structural equation models, 136-162. Newbury Park, CA: Sage. Browne, M.W. and G. Mels, 1992. RAMONA user's guide. Department of Psychology, Ohio State University, Columbus, Ohio. Cliff, N., 1983. Some cautions concerning the application of causal modeling methods. Multivariate Behavioral Research 18, 115-126. Cudeck, R., 1989. Analysis of correlation matrices using covariance structure models. Psychological Bulletin 1989, 317-327. Cudeck, R. and M.W. Browne, 1983. Cross-validation of covariance structures. Multivariate Behavioral Research 18, 147-167. Davis, W.R., 1992. The LISRES macro, Unpublished manuscript, University of North Carolina. Dillon, W.R., 1986. Building consumer behavior models with LISREL: Issues in applications. In: D. Brinberg and R.J. Lutz (eds.), Perspectives on Methodology in Consumer Research, 107-154, New York: Springer. Duncan, O.D., 1975. Introduction to Structural Equation Models. New York: Academic Press. Fornell, C., 1983. Issues in the application of covariance structure analysis. Journal of Consumer Research 9, 443-448. Fornell, C. and F.L. Bookstein, 1982. Two structural equation models: LISREL and PLS applied to consumer exit-voice theory. Journal of Marketing Research 19, 440-452. Fornell, C. and D.F. Larcker, 1981. Evaluating structural equation models with unobservable variables and measurement errors. Journal of Marketing Research 18, 39-50. Fraser, C., 1980. COSAN user's guide, Centre for Behavioral

H. Baumgartner, C. Homburg~Intern. J. of Research in Marketing 13 (1996) 139-161


Studies, University of New England, Armidale, New South Wales, Australia. Freedman, D.A., 1987. As others see us: A case study in path analysis. Journal of Educational Statistics 12, 101-128. Freedman, D.A., 1991. Statistical models and shoe leather. In: P.V. Mardsen (ed.), Sociological Methodology 1991, 291-313. Oxford, England: Basil Blackwell. Gerbing, D.W. and J.C. Anderson, 1988. An updated paradigm for scale development incorporating unidimensionality and its assessment. Journal of Marketing Research 25, 186-192. Gerbing, D.W. and J.C. Anderson, 1993. Monte carlo evaluations of goodness-of-fit indices for structural equation models. In: K.A. Bollen and J.S. Long (eds.), Testing structural equation models, 40-65. Newbury Park, CA: Sage. Hattie, J.A., 1985. Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement 9, 139-164. Homburg, C., 1991. Cross-validation and information criteria in causal modeling. Journal of Marketing Research 28, 137-144. Homburg, C. and A. Dobratz, 1992. Covariance structure analysis via specification searches. Statistical Papers 33, 119-142. Hu, L., P.M. Bentler, and Y. Kano, 1992. Can test statistics in covariance structure analysis be trusted? Psychological Bulletin 112, 351-362. J~ireskog, K.G., 1993. Testing structural equation models. In: K.A. Bollen and J.S. Long (eds.), Testing structural equation models, 294-316. Newbury Park, CA: Sage. J~ireskog, K.G. and D. SiSrbom, 1993a. LISREL8: User's reference guide. Mooresville, IN: Scientific Software. JiSreskog, K.G. and D. SiSrbom, 1993b. PRELIS: A program for multivariate data screening and data summarization. Mooresville, IN: Scientific Software. Kaplan, D., 1990. Evaluating and modifying covariance structure models: A review and recommendation. Multivariate Behavioral Research 25, 137-155. Kenny, D.A., 1979. Correlation and causality. New York: Wiley. Luijben, T.C.W., 1991. Equivalent models in covariance structure analysis. Psychometrika 56, 653-665. MacCallum, R., 1986. Specification searches in covariance structure modeling. Psychological Bulletin 100, 107-120. Marsh, H.W., J.W. Balla, and R.P. McDonald, 1988. Goodnessof-fit indices in confirmatory factor analysis: Effects of sample size. Psychological Bulletin 103, 391-411. Martin, J.A., 1987. Structural equation modeling: A guide for the perplexed. Child Development 58, 33-37.

161

McDonald, R.P., 1989. An index of goodness-of-fit based on noncentrality. Journal of Classification 6, 97-103. McDonald, R.P. and H.W. Marsh, 1990. Choosing a multivariate model: Noncentrality and goodness of fit. Psychological Bulletin 107, 247-255. Mulaik, S.A., L.R. James, J. Van Alstine, N. Bennett, S. Lind, and C.D. Stilwell, 1989. Evaluation of goodness-of-fit indices for structural equation models. Psychological Bulletin 105, 430445. Muth6n, B.O., 1987. LISCOMP: Analysis of linear structural relations with a comprehensive measurement model. Mooresville, IN: Scientific Software. Ray, M.L., 1979. The critical need for a marketing measurement tradition: A proposal. In: O.C. Ferrel, S.W. Brown, and C.W. Lamb (eds.), Conceptual and theoretical developments in marketing, 34-48. Chicago, IL: American Marketing Association. Schaubroeck, J., 1990. Investigating reciprocal causation in organizational behavior research. Journal of Organizational Behavior 11, 17-28. Schoenberg, R.J., 1989. LINCS: Linear covariance structure analysis. User's guide. Kent, WA: RJS Software. Sharma, S., S. Durvasula, and W.R. Dillon, 1989. Some results on the behavior of alternate covariance structure estimation procedures in the presence of non-normal data. Journal of Marketing Research 26, 214-221. Steenkamp, J.B. and H. van Trijp, 1991. The use of LISREL in validating marketing constructs. International Journal of Research in Marketing 8, 283-299. Steiger, J.H., 1989. EZPATH causal modeling: A supplementary module for SYSTAT and SYGRAPH. Evanston, IL: SYSTAT. Steiger, J.H., t990. Structural model evaluation and modification: An interval estimation approach. Multivariate Behavioral Research 25, 173-180. Stelzl, I., 1986. Changing a causal hypothesis without changing the fit: Some rules for generating equivalent path models. Multivariate Behavioral Research 21,309-331. Tanaka, J.S., 1993. Multifaceted conceptions of fit in structural equation models. In: K.A. Bollen and J.S. Long (eds.), Testing structural equation models, 10-39. Newbury Park, CA: Sage. Teel, J.E., W.O. Bearden, and S. Sharma, 1986. Interpreting LISREL estimates of explained variance in nonrecursive structural equation models. Journal of Marketing Research 23, 164-168. Tucker, LR. and C. Lewis, 1973. The reliability coefficient for maximum likelihood factor analysis. Psychometrika 38, 1-10.

Вам также может понравиться