Вы находитесь на странице: 1из 28

This article was downloaded by: [Iowa State University] On: 03 January 2013, At: 07:12 Publisher: Routledge

Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Language Assessment Quarterly


Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/hlaq20

Structural Equation Modeling in Language Testing and Learning Research: A Review


Yo In'nami & Rie Koizumi
a b a b

Toyohashi University of Technology

Tokiwa University Version of record first published: 16 Aug 2011.

To cite this article: Yo In'nami & Rie Koizumi (2011): Structural Equation Modeling in Language Testing and Learning Research: A Review, Language Assessment Quarterly, 8:3, 250-276 To link to this article: http://dx.doi.org/10.1080/15434303.2011.582203

PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Language Assessment Quarterly, 8: 250276, 2011 Copyright Taylor & Francis Group, LLC ISSN: 1543-4303 print / 1543-4311 online DOI: 10.1080/15434303.2011.582203

Structural Equation Modeling in Language Testing and Learning Research: A Review


Yo Innami
Toyohashi University of Technology

Downloaded by [Iowa State University] at 07:12 03 January 2013

Rie Koizumi
Tokiwa University

Despite the recent increase of structural equation modeling (SEM) in language testing and learning research and Kunnans (1998) call for the proper use of SEM to produce useful ndings, there seem to be no reviews about how SEM is applied in these areas or about the extent to which the current application accords with appropriate practices. To narrow these gaps, we investigated the characteristics of the use of SEM in language testing and learning research. Electronic and manual searches of 20 journals revealed 50 articles containing a total of 360 models analyzed using SEM. We discovered that SEM was most often used to investigate learners strategy use and trait/test structure. Maximum likelihood methods were most often used to estimate parameters of a model; model t indices of chi-squares, comparative t index, root mean square error of approximation, and TuckerLewis index were often reported, but standardized root mean square residual rarely was. Univariate and multivariate normality checks were infrequently reported, as was missing data treatment. Sample sizes, when judged according to Klines (2005) and Raykov and Marcoulidess (2006) guidelines, were in most cases adequate, and LISREL was the most widely used program. Two recommendations are provided for the better practice of using and reporting SEM for language testing and learning research.

INTRODUCTION Structural equation modeling (SEM), also known as covariance structure analysis or simultaneous equation modeling, is a statistical technique for examining the nature of the relationships among observed and latent variables that applies a conrmatory, hypothesis-testing approach to the data (e.g., Bollen, 1989b; Byrne, 2006; Ullman, 2007). SEM has been widely used in many areas, such as marketing (e.g., Hulland, Chow, & Lam, 1996; Williams, Edwards, & Vandenberg, 2003) and psychology (e.g., Breckler, 1990; MacCallum & Austin, 2000). In language testing and learning research, SEM was introduced (Kunnan, 1998) and has been reviewed along with other statistical methods (e.g., Bachman, 1998; Bachman & Eignor, 1997; Drnyei, 2007; Lumley & Brown, 2005). It has been used primarily to examine (a) the factor structure of the abilities

Correspondence should be sent to Yo Innami, Institute of Liberal Arts and Sciences, Toyohashi University of Technology, Toyohashi, Aichi 441-8580, Japan. E-mail: innami@las.tut.ac.jp

STRUCTURAL EQUATION MODELING: A REVIEW

251

Downloaded by [Iowa State University] at 07:12 03 January 2013

assessed by tests or learner characteristics collected via questionnaires (e.g., Gorsuch, 2000; Lee, 2005; Purpura, 1997; Sasaki, 1996; Schoonen et al., 2003), (b) the effect of different constructs and test methods on test performance (e.g., Bachman & Palmer, 1981; Llosa, 2007; Sawaki, 2007), (c) the equivalency of models across different populations (e.g., Bae & Bachman, 1998; Ginther & Stevens, 1998; Kunnan, 1995; Shin, 2005), and (d) the effects of learning context on language development across time (Matsumura, 2003; Ross, 2005). Reecting the conrmatory, hypothesis-testing nature of SEM, Fornell (1982) stated that SEM is best used if researchers base their model on what previous research has found, and this process results in the effective accumulation and advancement of knowledge. Such accumulation and advancement of knowledge presupposes the proper use of SEM for the models tested. Further, Kunnan (1998) argued that methods and techniques used in any academic eld must be reviewed to determine if they are skillfully used in order to clarify problems in the eld. Because there seem to be no reviews in internationally accessible journals about how SEM is used in the elds of language testing and learning, this article attempts to ll this gap by reviewing articles using SEM in those elds. LITERATURE REVIEW SEM SEM is a statistical method to examine various relationships that are hypothesized among sets of variables (e.g., Bollen, 1989b; Byrne, 2006; Ullman, 2007). In each analysis, a researcher hypothesizes, based on previous ndings, a model showing relationships among variables and tests whether the model is supported by sample data. According to Byrne (2006), SEM is widely used across various elds for four reasons. First, SEM takes a conrmatory, hypothesis-testing approach to the data, in contrast to traditional analysis, such as exploratory factor analysis, where analysis is data driven. Second, SEM is designed to correct for measurement errors of variables. The results would allow a researcher to interpret the relationship among variables, separating the measurement errors. Third, SEM can analyze both unobserved (i.e., latent) and observed variables. This contrasts with path analysis that enables researchers to model only observed variables. Latent variables are used to dene factors or constructs. Fourth, multivariate relations or indirect effects can be analyzed using SEM, whereas no other statistical methods can easily do this. Investigation into multivariate relations may include models where correlations are hypothesized only among a certain set of variables. Investigating indirect effects may include determining whether an independent variable directly affects a dependent variable or whether it does so through a mediating variable. Path analysis can be used to model these multivariate relations or indirect effects with observed variables, but it cannot be used to conduct analyses using unobserved variables. Five Steps in an SEM Application Bollen and Long (1993) described ve steps involved in an SEM application: (a) model specication, (b) model identication, (c) parameter estimation, (d) model t, and (e) model respecication. First, the model specication step requires a researcher to develop a model based

252

INNAMI AND KOIZUMI

on previous research in the area. This reects the conrmatory, theory-driven nature of SEM. Second, the identication step concerns whether the model can obtain a unique value for each parameter in the model for which value is unknown using the variance/covariance matrix (or the correlation matrix and standard deviations) of the measured variables. If there are too many parameters to be estimated relative to the number of variances and covariances in the matrix, the model is underidentied and cannot be tested. Third, the parameter estimation step involves obtaining an estimate for each parameter. Several estimation methods are available, including the maximum likelihood method and the generalized least squares method, both for multivariate normal data, and the robust maximum likelihood method for nonnormal data. The fourth step tests the degree to which the model ts the data. Fifth, the model respecication step explores improving the model-data t, for example, by deleting statistically nonsignicant paths or adding possible paths. Such model respecication must be theoretically supported in the same way as model specication. Because the fourth step of testing model t with the data is conducted by inspecting various types of t indices and because it has been extensively discussed in the SEM literature, further explanation is necessary. A good t can be indicated by a nonsignicant chi-square ( 2 ) value, but this index is known to be affected by sample size. To overcome this problem, many t indices have been created, which can be classied into four types based on Byrne (2006) and Kline (2005). The rst type, incremental or comparative t indices, assesses the relative improvement in t of the researchers model compared with the null model, which assumes no covariances among the observed variables. Examples are the comparative t index (CFI), the normed t index (NFI), and the TuckerLewis index (TLI; also known as the nonnormed t index). The second type, absolute t indices, calculates a proportion of variance explained by the model in the sample variance/covariance matrix. Examples include the goodness-of-t index (GFI) and the adjusted GFI. The third type, residual-based t indices, assesses differences between observed and predicted variances and covariances. Examples are the standardized root mean square residual (SRMR) and the root mean square error of approximation (RMSEA). The fourth type, predictive t indices, assesses the likelihood of the model to t in similar-size samples from the same population. Examples include the Akaike Information Criterion (AIC), the Consistent Akaike Information Criterion (CAIC), and the Expected Cross-Validation Index (ECVI). For the rst three types of t indices, models consistently producing good results across many of these indices are considered good models. In contrast, the last type, predictive t indices, is used to select the best model among competing models (i.e., among similar models using the same data). A common procedure is that if two models are nested, that is, if a path is deleted or added from Model A to form Model B, a chi-square difference test is used to compare the original and modied models; if two models are not nested, the predictive t indices, including AIC, CAIC, and ECVI, are employed. Although no agreed-upon guidelines exist regarding which t indices should be reported, reviewing the SEM literature provides some clues as to the appropriate reporting practices of t indices. Table 1 shows a list of works in English providing guidelines for t indices that should be reported. Works reviewing the use of SEM but not presenting such guidelines explicitly are not listed. We found that the most often recommended indices are the chi-square, CFI, TLI, RMSEA (and its condence interval), and SRMR. Note that it is assumed that chi-squares are reported along with degrees of freedom and p values; these three statistics are a group of indices used together to evaluate model t. Degrees of freedom and p values are employed to examine

Downloaded by [Iowa State University] at 07:12 03 January 2013

Downloaded by [Iowa State University] at 07:12 03 January 2013

TABLE 1 Fit Indices Recommended in the Literature Incremental CFI Gamma Hat IFI MC NFI RNI TLI URFI GFI AGFI RMR RMSEA [CI] Absolute Residual Based SRMR

Basic

2 x x x x x x x x x x x x x x x x x x x x x x x x x x

x x

x x x x x x x x x x x x x x x 3 6 4 2 5 x x 17 x x x 15 x 1 4 x x

x x x x x x x

x [x] x [x] x x x x [x] x [x] x x

x x x x x

x x x x x x x 1 1 19 [8] x [x] 15 x

Simulation articles Hu & Bentler (1998, 1999) Review articles Hoyle & Panter (1995) Boomsma (2000)a MacCallum & Austin (2000) McDonald & Ho (2002) Russell (2002) Martens (2005) Kahn (2006) Worthington & Whittaker (2006) Jackson et al. (2009) Kashy et al. (2009) Introductory textbooks Kline (2005) Brown (2006) Ullman (2007)b Mueller & Hancock (2008) Bandalos & Finney (2010) Lomax (2010) Mueller & Hancock (2010) Widaman (2010) x x x x x x x x x

x [x] x x x [x] x [x] x x [x] x

Total no. Our recommendation

10 x

253

Note. Predictive t indices are not listed because they are different in nature in that they are used to compare models. IFI, also called Bollens t index (BL89) and Bollens delta 2, is less variable in small samples and more consistent across estimators than TLI (Bollen, 1989a). MC transforms the population noncentrality index into a range from 0 to 1 (McDonald, 1989). RNI and URFI are equivalent to CFI, except they can be below 0 or over 1, showing the extent of overt (see Hoyle & Panter, 1995, for RNI; McDonald & Ho, 2002, for URFI). CFI = comparative t index; IFI = incremental t index; MC = McDonalds Centrality Index; NFI = normed t index; RNI = relative noncentrality index; TLI = TuckerLewis index; URFI = Unbiased Relative Fit Index; GFI = goodness-of-t index; AGFI = adjusted GFI; RMR = root mean square residual; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual. a She argued that SRMR may be more preferable than RMSEA and TLI with a small sample size. b She also recommended reporting AIC and CAIC when comparing nonnested models.

254

INNAMI AND KOIZUMI

whether the obtained chi-square values are statistically signicant, and both of them cannot stand alone. Thus, although some authors (e.g., Widaman, 2010) encourage reporting chi-squares and do not mention reporting corresponding degrees of freedom and p values, the need to report the latter two statistics together with chi-squares is implied. We found that Hu and Bentler (1998, 1999) were cited in 10 of the works listed in Table 1. Hu and Bentler (1998, 1999) recommended reporting SRMR, along with CFI, TLI, RMSEA, or other indices (i.e., Gamma Hat, Incremental Fit Index [IFI], McDonalds Centrality Index [MC], Relative Noncentrality Index [RNI]). There are at least four reasons for valuing SRMR reporting. First, Bentler (2006) argued that SRMRs intuitive interpretability is especially useful to readers who are familiar with correlations but not so much with t indices. Hu and Bentler (1995) stated the following:

Downloaded by [Iowa State University] at 07:12 03 January 2013

if the discrepancy between the observed correlations and the model-reproduced correlations are [sic] very small, clearly the model is good at accounting for the correlations no matter what the 2 or t indexes seem to imply. For example, if the average of the absolute values of the discrepancy between observed and reproduced correlations is .02, it is simply a fact that the model explains the correlations to [sic] within an average error of .02. . . . We suggest that this descriptive information always accompany reports of model t, to round out the more popularly used 2 and t indexes. (p. 98)

Second, SRMR concurs well with the objective of SEM, which is to reproduce, as closely as possible, observed correlation/covariance matrices using model-estimated correlation/covariance matrices (e.g., Bollen, 1989b). Third, unlike other t indices where chi-squares are used as part of the calculation (e.g., CFI, GFI, RMSEA), SRMR is not based on chi-squares. Rather, it is the average difference between the correlations observed in the input matrix and the correlations predicted by the model. Thus, SRMR can provide a unique perspective for the model t. Fourth, Bentler (2006) argued that SRMR is also sensitive to model misspecication, although Fan and Sivo (2005) and Yuan (2005) disagreed with this. Considering the advantages of SRMR, reporting SRMR together with other statistics seems to be appropriate practice. In addition to Hu and Bentler (1998, 1999), several guidelines for reporting combinations of t indices have been proposed. Hoyle and Panter (1995) recommended reporting chi-squares and Satorra-Bentler corrected chi-squares (and degrees of freedom and p values associated with them), CFI, TLI, and GFI when using maximum likelihood estimation and IFI instead of TLI when using generalized least squares estimation. Russell (2002) recommended reporting SRMR and one of the following indices: CFI, IFI, MC, RNI, TLI, or RMSEA. Kashy, Donnellan, Ackerman, and Russell (2009) recommended reporting CFI or TLI along with the chi-square and RMSEA. Bandalos and Finney (2010) recommended the chi-square, CFI, TLI, RMSEA, and SRMR, whereas Mueller and Hancock (2010) recommended RMSEA and its condence interval, SRMR, and at least one of CFI, NFI, and TLI. Widaman (2010) encouraged reporting the chi-square, CFI, TLI, and RMSEA. For testing measurement invariance across groups (e.g., whether the factor loadings are the same across groups), Cheung and Rensvold (2002) recommended reporting CFI, Gamma Hat, and McDonalds Noncentrality Index and interpreting reduction in each index as evidence for measurement invariance. Following Hu and Bentler (1998, 1999) and the overall recommendations of other researchers, we argue that SRMR should be reported, along with the chi-square, CFI, TLI, or RMSEA (with its condence interval), unless one has other explicit rationales for t index selection, and that reporting two or more t indices is preferable.

STRUCTURAL EQUATION MODELING: A REVIEW

255

Measurement Issues in SEM According to Byrne (2006) and Kline (2005), the proper use of SEM must satisfy at least three statistical requirements. First, most estimation methods used in SEM require univariate and multivariate normality of data. Nonnormality can affect standard error estimates, chi-squares, and consequently model t indices based on chi-squares. It can be checked by calculating univariate skewness and kurtosis and Mardias multivariate kurtosis. Nonnormal data can be transformed and analyzed if normal distribution is expected in the population. If nonnormality is expected, it can be handled by using appropriate estimation methods (e.g., the robust maximum likelihood, which produces the Satorra-Bentler corrected chi-square statistic, or the robust weighted least squares). Second, missing data can be handled by listwise deletion (removal of all cases with missing values from subsequent analysis) or pairwise deletion (removal of paired cases, either or both of which has missing values). However, a more sophisticated approach would be full information maximum likelihood estimation, where all available data are analyzed regardless of incompleteness (e.g., Enders, 2001). Missing data can also be imputed by, for example, mean substitution, regression imputation, or an expectation maximization algorithm. It should be noted that listwise and pairwise methods work only when data are missing completely at random, although this assumption is often violated in practice (e.g., B. Muthn, Kaplan, & Hollis, 1987). Both full information maximum likelihood estimation and expectation maximization algorithm methods are effective for situations where the data are missing completely at random or missing at random (for more on missing data analysis, see Allison, 2003). Third, because parameter estimates and most t indices are calculated using sample size, an adequate sample size is necessary. No denitive rules of a necessary sample size have been established because sample size and many variables intricately affect the t (MacCallum & Austin, 2000). However, Kline (2005) heuristically stated that a sample size below 100 is often considered small, a sample size between 100 and 200 is medium, and a sample size exceeding 200 is large. Similarly, Ding, Velicer, and Harlow (1995) stated that most previous studies agreed that 100 to 150 participants is the minimum sample size adequate for analysis. Further, Raykov and Marcoulides (2006) and many others considered model complexity, which is reected in the number of free parameters needed to be estimated in a model. They argued that a minimum desirable sample size would be 10 times the number of free model parameters. For example, a model with 30 free parameters would require 300 observations or participants (10 30). Nevertheless, Brown (2006) stressed that these guidelines are crude and do not necessarily apply to the researchers data and model as requisite sample size varies with many variables, including the amount and patterns of missing data, strength of the relationships among the indicators, types of indicators (e.g., categorical or continuous) and estimators (e.g., [robust] maximum likelihood, robust weighted least squares), and reliability of the indicators. Among more accurate methods for determining appropriate sample sizes, perhaps the most practical and accessible one for researchers of language testing and learning is Monte Carlo analysis where the appropriateness of sample size is estimated under various conditions as just described by taking into account the statistical power and precision of model parameter estimates (for details, see L. K. Muthn & Muthn, 2002). In addition to the three requirements for SEM, there are two other issues that researchers need to keep in mind when using SEM. First, it is recommended that the variance/covariance matrix or the correlation matrix with standard deviations be provided in the article or be made available

Downloaded by [Iowa State University] at 07:12 03 January 2013

256

INNAMI AND KOIZUMI

upon request. Access to such information allows researchers to reproduce the original model and examine alternative models not tested in the primary study (see Innami & Koizumi, 2010, for further details). Second, the name and version of software used should be reported because calculation formulas and default methods sometimes differ across software programs (e.g., calculation formula of AIC between Amos [Arbuckle, 19942011] and EQS [Bentler, 19942011], as explained in Brown, 2006) and even within different versions of the same software programs. Current Study Despite the increasing use of SEM in language testing and learning research and Kunnans (1998) call for the proper use of SEM to produce useful ndings, there seem to be no reviews in internationally accessible journals about how SEM is applied in these areas or about the extent to which the current application accords with appropriate practices. To narrow these gaps, we addressed the following research question: What are the characteristics of use of structural equation modeling in language testing and learning research? We investigated this in terms of (a) content analysis, (b) parameter estimation methods, (c) the types and frequency of model t indices used, (d) normality checks, (e) missing data treatment, (f) comparison of actual and adequate sample sizes, and (g) software used. We also conducted content analysis to examine the topics in language testing and learning that have been researched using the SEM approach. Further, we focused on two of the ve aforementioned steps in an SEM applicationparameter estimation and model t indices. This is because we did not have the range of content expertise to evaluate the theoretical justication of model specication and model respecication, and because all the models reported in articles are usually statistically identied (i.e., it is possible to compute a unique estimate of each parameter) so that it made little sense to incorporate this step in the review. By conducting this review, we hope to raise awareness about the proper use and reporting of SEM among researchers in the elds of language testing and learning.

Downloaded by [Iowa State University] at 07:12 03 January 2013

METHOD Article Collection We searched for empirical studies using SEM in language testing and learning in October 2008 in two ways. First, we retrieved studies from the following 20 representative journals using Education Resources Information Center (ERIC), Linguistics and Language Behavior Abstracts (LLBA), and search engines available at journal home pages: Annual Review of Applied Linguistics, Applied Language Learning, Applied Linguistics, Assessing Writing, ELT Journal, Foreign Language Annals, International Journal of Applied Linguistics, International Review of Applied Linguistics in Language Teaching, Language Assessment Quarterly, Language Learning, Language Learning & Technology, Language Teaching, Language Teaching Research, Language Testing, Modern Language Journal, RELC Journal, Second Language Research, Studies in Second Language Acquisition, System, and TESOL Quarterly. The following keywords were independently used: causal analysis (analyses), causal model(s), causal modeling (modelling), conrmatory factor analysis (analyses), covariance structure(s), covariance structure

STRUCTURAL EQUATION MODELING: A REVIEW

257

analysis (analyses), covariance structure model(s), simultaneous equation model(s), simultaneous equation modeling (modelling), structural equation model(s), and structural equation modeling (modelling). We compiled these sets of keywords based on the keywords and synonyms found in books, articles, our experiences, and feedback from colleagues. Abstract, title, and article keyword searches were used. A date range restriction was not imposed. Second, we conducted manual searches for relevant studies in these journals to crosscheck studies identied electronically. The reference list of empirical, theoretical, and review papers was further inspected for additional relevant materials. The literature search was limited to published journal articles, as we intended to examine the status quo based on internationally accessible published sources, although we were aware of many unpublished SEM studies such as Liao (2009) and Yamashiro (2002).

Downloaded by [Iowa State University] at 07:12 03 January 2013

Analyses The article was used as a unit of analysis. When multiple models existed in one article, information on the nal model was coded. When several nal models existed (i.e., multiple research questions were examined in one article, and each research question was accompanied by a nal model, resulting in multiple nal models in one article), one of the nal models was randomly selected. However, results based on the model as a unit of analysis (i.e., counting all models included in an article) were also reported for reference. To calculate the minimum desirable sample size for SEM analyses, the number of free parameters in each model was calculated and multiplied by 10, following Raykov and Marcoulides (2006). The number of free parameters in a model was calculated by drawing a model with Amos 4.01 (Arbuckle, 1999) and entering (a) the corresponding correlation (with standard deviations) or variance/covariance matrix, or (b) the corresponding size (row column) of a hypothetical matrix for a model. Conducting Monte Carlo studies to evaluate the appropriateness of sample size for each model was beyond the scope of this article, although we plan to undertake these in the near future. To check the consistency of coding many kinds of information on articles and models, a sample of 8% (30 models) of the 360 models (see Results and Discussion section) was independently examined by both authors. The agreement percentage and kappa coefcient for all the coding were 95% and .90, respectively. When the two authors disagreed, they discussed their decisions and rationale for giving their ratings while reconsidering the models for which coding discrepancy occurred. Most of the coded informationespecially parameter estimation and model t indicesappeared in the Methods, Results, or Discussion sections of an article and was explicitly reported. The remaining models were examined by the rst author.

RESULTS AND DISCUSSION SEM in Language Testing and Learning Studies Nine of the 20 journals reviewed included articles using SEM. As Table 2 indicates, most articles using SEM appeared in Language Testing (42% of the articles; 21/50), followed by Language Learning (20%; 10/50). This suggests that SEM has been used particularly in studies on language

258

INNAMI AND KOIZUMI

TABLE 2 Number of Articles and Models by Journal AL Article Model 4 19 LAQ 1 1 LL 10 48 LT 21 224 MLJ 6 50 RELC 1 1 SSLA 1 2 System 2 3 TQ 4 12 Total 50 360

Note. The total number also applies to Tables 5, 7, and 8. AL = Applied Linguistics; LAQ = Language Assessment Quarterly; LL = Language Learning; LT = Language Testing; MLJ = Modern Language Journal; RELC = RELC Journal; SSLA = Studies in Second Language Acquisition; TQ = TESOL Quarterly.

Downloaded by [Iowa State University] at 07:12 03 January 2013

TABLE 3 Number of Articles and Models by Year 81 82 86 Article Model 1 4 2 9 1 9 89 2 20 93 94 95 1 7 1 8 2 7 97 3 24 98 3 26 00 2 9 02 2 17 03 4 29 04 1 1 05 7 59 06 8 79 07 5 29 08 5 23 Total 50 360

testing compared to studies on language education in general. Table 3 shows that the rst article using SEM appeared in 1981. Since then, an increasing number of articles have used SEM. The total number of models from the 50 articles in the nine journals was 360. Those 50 articles are marked with asterisks in the references. Research Question: What Are the Characteristics of Use of Structural Equation Modeling in Language Testing and Learning Research? Content Analysis Table 4 shows the frequency of articles by content area. These data were broadly divided into test-taker variables and test-related/task-related variables and reected the diverse areas of language testing and learning that have been subjected to SEM. The most researched area among test-taker variables was strategy (26%; 13/50), followed by motivation (16%) and social milieu (12%), and among test-related/task-related variables, it was trait/test structure (20%). It should be noted that these three variablesmotivation, social milieu, and trait/test structureare those stated in Kunnan (1998) as promising lines of research and that an increasing number of research works regarding these variables indicate advances in our eld. In fact, more than half of the studies investigating these variables were published in 1999 or later. Table 4 also displays areas where SEM has been used but where the number of such studies is limited as represented by explicit and implicit knowledge and self-rating. It should be noted that Table 4 was developed inductively and does not list all topics of substantive interest that can benet from application of SEM. Examples of such topics of interest may include investigation into developmental patterns of L2 acquisition, L1 transfer effects on L2 learning, the relationship between personality and test scores, and the comparison between instructed and natural settings for L2 learning. Finally, although not shown in Table 4, nine of the 10 articles investigating trait/test structure were published

STRUCTURAL EQUATION MODELING: A REVIEW

259

TABLE 4 Frequency of Articles by Content Area Article (%) Test-taker variables Strategy Motivation Social milieu or cultural background Anxiety Reading Speaking Vocabulary Listening Writing Longitudinal growth Metacognitive knowledge Grammar Willingness to communicate Aptitude Explicit & implicit knowledge Intelligence Pragmatic development Teacher perception on communicative activities Test-/task-related variables Trait/test structure Cloze test Differential item (or test) functioning Rater effect C-test Classroom assessment & standardized tests Overall task characteristics Paper-/computer-based test comparison Self-rating

Downloaded by [Iowa State University] at 07:12 03 January 2013

13 (26%) 8 (16%) 6 (12%) 5 (10%) 5 (10%) 5 (10%) 5 (10%) 4 (8%) 4 (8%) 3 (6%) 3 (6%) 2 (4%) 2 (4%) 1 (2%) 1 (2%) 1 (2%) 1 (2%) 1 (2%) 10 (20%) 2 (4%) 2 (4%) 2 (4%) 1 (2%) 1 (2%) 1 (2%) 1 (2%) 1 (2%)

Note. Articles investigating more than one variable were classied into more than one category.

in Language Testing, suggesting that SEM is an essential methodology for examining trait/test structure among language testers. To explore whether researchers who examined certain content areas selected different parameter estimation methods, types, and frequencies of model t indices, normality checks, missing data treatments, ways of comparing actual and adequate sample sizes, and software, we selected the top four topics in Table 4 (i.e., strategy, motivation, social milieu, and trait/test structure) and presented the results separately next. Parameter Estimation As Table 5 shows, estimation was in most cases conducted using the maximum likelihood method (52%; 26/50). This was not surprising because the maximum likelihood method is the

260

INNAMI AND KOIZUMI

TABLE 5 Characteristics of Parameter Estimation Methods and Model Fit Indices Used Article Alla Parameter estimation Generalized least squares Maximum likelihood Robust maximum likelihood Unweighted least squares Weighted least squares (asymptotically distribution free) Bootstrap Not reported Model t Basic t indices Chi-squaref p valuef Degree of freedomf 2 /df SB 2 SB 2 /df Incremental or comparative t indices CFI IFI NFI PCFI PNFI RFI PRATIO TLI Absolute t indices GFI AGFI Residual-based t indices RMR RMSEA RMSEA condence interval SRMR Predictive t indices AIC CAIC Degree of likelihood ratio ECVI ECVI condence interval Strategyb Motivationc Social Milieud Trait Structuree Model

1 (2%) 26 (52%)g 4 (8%)g 1 (2%) 1 (2%)

0 6 (46%) 0 0 0

0 2 (25%) 0 0 0

0 5 (83%) 0 0 0

0 6 (60%) 3 (30%) 0 0

20 (6%) 220 (61%)g 43 (9%)g 4 (1%) 27 (7%)

Downloaded by [Iowa State University] at 07:12 03 January 2013

1 (2%) 19 (30%)

1 (8%) 6 (46%)

0 4 (50%)

0 1 (17%)

0 3 (30%)

1 (0.3%) 45 (13%)

42 (84%) 35 (70%) 42 (84%) 15 (30%) 4 (8%) 2 (4%)

12 (92%) 9 (69%) 11 (85%) 5 (38%) 0 0

6 (75%) 5 (63%) 5 (63%) 5 (63%) 0 0

4 (67%) 4 (67%) 4 (67%) 3 (50%) 1 (17%) 0

8 (80%) 7 (70%) 9 (90%) 3 (30%) 2 (20%) 1 (10%)

311 (86%) 201 (56%) 303 (84%) 119 (33%) 28 (8%) 8 (2%)

32 (64%) 5 (10%) 15 (30%) 2 (4%) 2 (4%) 2 (4%) 1 (2%) 22 (44%) 15 (30%) 10 (20%) 3 (6%) 25 (50%) 2 (4%) 1 (2%) 1 (2%) 1 (2%) 1 (2%) 2 (4%) 1 (2%)

10 (77%) 3 (23%) 6 (46%) 0 1 (8%) 0 0 11 (85%) 5 (38%) 5 (38%) 1 (8%) 8 (62%) 0 1 (8%) 0 0 0 1 (8%) 0

6 (75%) 3 (38%) 4 (50%) 2 (25%) 1 (13%) 1 (13%) 1 (13%) 4 (50%) 5 (63%) 5 (63%) 1 (13%) 5 (63%) 0 0 0 0 0 0 0

6 (100%) 1 (17%) 3 (50%) 2 (34%) 1 (17%) 1 (17%) 1 (17%) 2 (34%) 1 (17%) 1 (17%) 0 4 (67%) 0 0 0 0 0 1 (17%) 1 (17%)

5 (50%) 0 2 (20%) 0 0 0 0 1 (10%) 1 (10%) 0 1 (10%) 4 (40%) 1 (10%) 0 1 (10%) 1 (10%) 0 0 0

220 (61%) 11 (3%) 65 (18%) 26 (7%) 4 (1%) 7 (2%) 1 (0.3%) 137 (38%) 70 (19%) 34 (9%) 13 (4%) 201 (56%) 16 (4%) 3 (1%) 4 (1%) 4 (1%) 20 (6%) 17 (5%) 12 (3%) (Continued)

STRUCTURAL EQUATION MODELING: A REVIEW

261

TABLE 5 (Continued) Article Alla Others Average standardized residuals Average off-diagonal standardized residuals Bollen and Stein bootstrap 2 difference test Hoelters Critical N Null (independent) model 2 Null (independent) model df Combination of t indicesh Hu & Bentler (1998, 1999) SRMR, CFI SRMR, TLI SRMR, RMSEA SRMR, IFI Hoyle & Panter (1995, for maximum likelihood estimation) 2 (& SB 2 ), df , p, CFI, TLI, GFI Hoyle & Panter (1995, for small samples [<150] with generalized least squares estimation) 2 (& SB 2 ), df , p, CFI, IFI, GFI Russell (2002) SRMR, CFI SRMR, IFI SRMR, MC SRMR, RNI SRMR, TLI SRMR, RMSEA Kashy, Donnellan, Ackerman, & Russell (2009) CFI, 2i , RMSEA TLI, 2i , RMSEA Bandalos & Finney (2010) 2i , CFI, TLI, RMSEA with CI, SRMR Mueller & Hancock (2010) RMSEA with CI, SRMR, CFI RMSEA with CI, SRMR, NFI RMSEA with CI, SRMR, TLI Strategyb Motivationc Social Milieud Trait Structuree Model

4 (8%) 4 (8%) 1 (2%) 15 (30%) 1 (2%) 4 (8%) 4 (8%)

1 (8%) 2 (15%) 1 (8%) 0 1 (8%) 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 1 (10%) 0 0 0 0 0

13 (4%) 12 (3%) 1 (0.3%) 140 (39%) 1 (0.3%) 7 (2%) 7 (2%)

Downloaded by [Iowa State University] at 07:12 03 January 2013

1 (2%) 1 (2%) 1 (2%) 0

1 (8%) 1 (8%) 1 (8%) 0

0 0 0 0

0 0 0 0

0 0 0 0

3 (1%) 3 (1%) 3 (1%) 0

3 (6%)

19 (5%)

1 (2%) 0 0 0 1 (2%) 1 (2%)

1 (8%) 0 0 0 1 (8%) 1 (8%)

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

1 (0.3%) 0 0 0 1 (0.3%) 1 (0.3%)

18 (32%) 12 (24%) 0

6 (46%) 7 (54%) 0

4 (50%) 2 (25%) 0

3 (50%) 0 0

3 (30%) 1 (10%) 0

132 (37%) 67 (19%) 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0 (Continued)

262

INNAMI AND KOIZUMI

TABLE 5 (Continued) Article Alla Widaman (2010) 2i , CFI, TLI, RMSEA Cheung & Rensvold (2002; for measurement invariance testing across groups) CFI, Gamma Hat, McDonalds Noncentrality Index Strategyb Motivationc Social Milieud Trait Structuree Model

12 (24%)

6 (46%)

2 (25%)

1 (10%)

66 (18%)

Downloaded by [Iowa State University] at 07:12 03 January 2013

Note. SB 2 = Satorra-Bentler corrected chi-square; CFI = comparative t index; IFI = incremental t index; NFI = normed t index; PCFI = parsimony-adjusted CFI; PNFI = parsimony-adjusted NFI; RFI = relative t index; PRATIO = parsimony ratio; TLI = Tucker-Lewis index; GFI = goodness-of-t index; AGFI = adjusted GFI; RMR = root mean square residual; RMSEA = root mean square error of approximation; SRMR = standardized RMR; AIC = Akaike Information Criterion; CAIC = consistent AIC; ECVI = expected cross-validation index; CI = condence interval. a k = 50. b k = 13. c k = 8. d k = 6. e k = 10. f Although these three statistics are presented separately for clarity, we reiterate that they are used together to examine model t. g Three articles containing a total of 31 models reported two estimation results (i.e., maximum likelihood and maximum likelihood robust estimations). Both estimation methods were separately coded. h When one article reported multiple t indices and thereby fell into multiple categories, coding was conducted as such. i Including the Satorra-Bentler corrected chi-square.

default in many SEM programs. The second highest percentage was explained by articles not reporting estimation methods (30%), which is problematic because the choice of estimation method would affect the validity of the results. A similar trend was also seen across the top four topics. One question of signicant interest concerns the extent to which parameter estimation was appropriate in the previous applications while considering various measurement issues, including the type of data being analyzed (e.g., categorical vs. continuous) and multivariate normality of data, which we summarized in Table 6. From all the articles and books we reviewed, we found Finney and DiStefano (2006) the most accessible, synthetic, authoritative, and up-to-date. They stated that if the variables have at least ve ordered categories and the data are simultaneously normally distributed, the data can be treated as continuous in nature and the maximum likelihood estimation is recommended; if the variables have at least ve ordered categories and the data are nonnormally distributed, the robust maximum likelihood estimation is recommended; if the variables have four or fewer categories, the robust weighted least squares estimator is recommended. The numerals representing the number of articles and models following these recommendations are italicized in Table 6, and this number was found to be very small: maximum likelihood (12%; 6/50) and robust maximum likelihood (2%). This showed that a majority of SEM applications were not appropriate with regard to the parameter estimation method. The nil use of robust weighted least squares estimation was not desirable but understandable because it is available only in Mplus (L. K. Muthn & Muthn, 19982011) which, as discussed next, has been rarely used in our eld (see Table 8).

STRUCTURAL EQUATION MODELING: A REVIEW

263

TABLE 6 Data Type, Distribution, and Parameter Estimation Robust Maximum Likelihood Robust Weighted Least Squares Weighted Least Bootstrapping Squares

Maximum Likelihood All 1. At least ve categories & normally distributed 2. At least ve categories & nonnormally distributed 3. Fewer than ve categories Strategy 1. 2. 3. Motivation 1. 2. 3. Social milieu 1. 2. 3. Trait structure 1. 2. 3.

6 (12%) [30 (8%)]

1 (2%) [12 (3%)] 2 (0.4%) [23 (6%)] 0 0 0 2 (34%) 1 (17%) 2 (34%)

1 (0.2%) [1 (0.003%)]

Downloaded by [Iowa State University] at 07:12 03 January 2013

3 (0.6%) [54 (15%)] 2 (15%) 0 0 0 2 (34%)

0 [0]

1 (0.2%) [27 (8%)]

0 0 0 0

1 (8%)

Note. This table included only studies that reported all three kinds of information (i.e., data type, distribution, and parameter estimation). A counted frequency outside the square bracket [ ] is based on an article as a unit of analysis (N = 50), whereas that inside the square bracket [ ] is based on a model as a unit of analysis (N = 360). The numerals representing the number of articles and models following these 330 recommendations are italicized.

Model Fit Indices Types of t indices used. Table 5 shows that chi-squares (84%; 42/50), p values (70%), and degrees of freedom (84%) were, overall, often reported. Other widely reported t indices were CFI (64%), RMSEA (50%), and TLI (44%). These practices seem appropriate because the SEM literature has frequently recommended the use of these indices. Condence intervals of RMSEA were rarely reported (4%), despite long-standing SEM literature advice to do so (e.g., MacCallum & Austin, 2000; Steiger, 1990). A similar pattern of t indices reporting was also observed across the top four topics. These results are consistent with Jackson, Gillaspy, and Purc-Stephenson (2009), who stated that in psychology, chi-squares (89%), CFI (78%), RMSEA (65%), and TLI (46%) were the most commonly reported measures of t. This indicates that these indices tended to be used often across elds. One exception was SRMR, which was rarely reported (2%) in the language testing and learning literature, but, as reported by Jackson et al., was more often reported in

264

INNAMI AND KOIZUMI

Downloaded by [Iowa State University] at 07:12 03 January 2013

psychology (23%). Thus, Hu and Bentlers (1998, 1999) recommendation to report SRMR is not well practiced in our eld. To examine the reason for the infrequent use of SRMR in the language testing and learning literature, the rst appearance of SRMR in each SEM computer program was investigated. To the best of our knowledge, SRMR is calculated in Amos (1995, version 3.1 or later),1 CALIS (SAS PROC CALIS; at least in the 2009 version or later), EQS (1995, version 4.0 or later), LISREL (1986, version VI or later), and Mplus (2001, version 2 or later). From the 50 articles, those in which the models were analyzed using these or later versions of the aforementioned programs were extracted. Articles not reporting software versions were excluded. We found 11 articles using Amos, 0 using CALIS (SAS Institute, 19902011), nine using EQS, seven using LISREL (Jreskog & Srbom, 19742011), and one using Mplus, totaling 28 articles. Given that such a large number of articles could have been reported with SRMR but that only one article reported this index, the reason for the infrequent use of SRMR was at least not a technical one.2 Further, it seems that most language testers and researchers who used SEM were not very familiar with Hu and Bentlers (1998, 1999) recommendation to report SRMR because only six of the 37 articles published in 1998 or later cited Hu and Bentler (1999) and none cited their 1998 publication. Moreover, the six articles that cited Hu and Bentler (1999) referred to the cutoff criteria for t indices (e.g., RMSEA values of approximately .06 or below indicate a good t) but did not refer to SRMR. We also found one instance of misuse of model t indices. In two articles, different criteria were used to evaluate different models, even though the same criteria should be used to compare all models in a study. In one article containing ve models, the author used p values of chisquares and NFI to interpret one model and used TLI and CFI to interpret the remaining four models. In another article containing 27 models, the authors used p values of chi-squares, TLI, CFI, GFI, RMSEA, and SRMR to interpret the rst two of their models but did not use GFI and SRMR to interpret the remaining 25 models. These two articles seem to suggest that the authors had implicit rationales for using certain t indices to interpret certain kinds of models, although such rationales were not made explicit in their articles. Frequency of t indices used. As seen in Table 7, 50 articles using SEM reported a various number of t indices from 0 to 12. On average, 5.16 t indices per study were reported overall; a smaller number of t indices per study (3.90) were reported for studies on trait/test structure. These results are appropriate because it is recommended that two or more indices be reported. Exceptions include four articles reporting either zero or one t index (i.e., zero in two articles, the chi-square statistic in one article, and CFI in one article). We further examined whether the use of t indices was appropriate in terms of their combinations because t indices properly used in tandem can provide complementary information to one another. Table 5 shows that only one of the 50 articles reported SRMR and one of the following CFI, TLI, or RMSEAas recommended by Hu and Bentler (1998, 1999). Likewise, only three of the 50 articles reported all of the chi-square (and the Satorra-Bentler corrected chi-square),
should be noted that calculation of SRMR in Amos requires the use of its plug-in function (see Garson, 2009). EQS, SRMR is reported under normality assumption but not as part of robust statistics for maximum likelihood, least squares, or generalized least squares estimation method. Even when using these robust statistics, SRMR that is reported under normality assumption needs to be reported. This is because it is not necessary to adjust SRMR for nonnormality, as it is simply the standardized difference between the observed and model-estimated covariances (EQS Technical Support Team [K. H. Kim], personal communication, September 9, 2010).
2 In 1 It

STRUCTURAL EQUATION MODELING: A REVIEW

265

TABLE 7 Number of Fit Indices Reported M Article All Strategy Motivation Social milieu Trait structure Model SD Minimum Maximum Skewness Kurtosis 0.41 0.04 0.03 1.79 2.58 1.27 Median Mode

5.16 7.00 6.63 6.33 3.90 4.47

2.91 2.35 3.50 3.56 2.69 2.44

0 2 1 1 0 0

12 10 12 12 10 14

0.38 0.60 0.07 0.19 1.12 0.65

5 7 5.5 6.5 3.5 5

5 5 5 5 5

Downloaded by [Iowa State University] at 07:12 03 January 2013

Note. Chi-squares, p values, and/or degrees of freedom are often reported together, so they are all coded as one t index.

degree of freedom, p value, CFI, TLI, and GFI, as suggested by Hoyle and Panter (1995, for maximum likelihood estimation). Further, the reporting practice did not follow Bandalos and Finney (2010), Mueller and Hancock (2010), and Cheung and Rensvold (2002, when testing measurement invariance). In contrast, approximately one fourth of the 50 articles, although still low in percentage, were satisfactory with regard to Kashy et al.s (2009) and Widamans (2010) guidelines. The results were similar when the top four content areas were analyzed individually. In sum, these researchers recommendations were not well reected in actual practice, and this needs to be rectied. Test of Data Normality As shown in Table 8, overall, fewer than half of the articles reported univariate and multivariate normality tests (44% [22/50] and 32% [16/50], respectively). The results were essentially the same regardless of the content areas. Although not shown in Table 8, we found that none of the six articles reporting 50 models in Modern Language Journal reported test of multivariate normality of data, nor did one article reporting two models in Studies in Second Language Acquisition and four articles reporting 12 models in TESOL Quarterly. Using nonnormal data would result in bias in calculating statistics, including t indices. Because univariate and multivariate normality can be easily checked using statistics available in SEM programs, this should be routinely inspected and reported. Missing Data Treatment The discussion on missing data treatment was often not reported (e.g., 66% among all studies; see Table 8), and it is not clear whether the data contained no missing responses, whether the missing data were deleted or imputed before analyzed, or whether the missing data were analyzed using the full information maximum likelihood method. Missing data were mostly handled using listwise deletion (20%). Listwise methods may not be desirable, however, because they work only on data missing completely at random, although this assumption is often violated in practice (e.g., B. Muthn et al., 1987). When imputation was used, which was rare (6% in total), details were not clear. In case of two articles containing ve models using EQS, it was not clear which of the implementation methods was used, whereas EQS has ve options (i.e., mean imputation, regression imputation, stochastic regression imputation, hot-deck imputation,

266

INNAMI AND KOIZUMI

TABLE 8 Characteristics of Measurement Issues Article All Test of data normality Univariate normality check (skewness & kurtosis) Multivariate normality test Not reported (univariate) Not reported (multivariate) Missing data treatment Listwise Pairwise M replacement Full information maximum likelihood Imputation (multiple imputation) Imputation (not specied) No missing data Not reported Software Amos CALIS EQS LISREL Mplus Not reported Strategy Motivation Social Milieu Trait Structure Model

22 (44%) 16 (32%) 28 (56%) 34 (66%) 10 (20%) 1 (2%) 1 (2%) 1 (2%) 1 (2%) 2 (4%) 1 (2%) 33 (66%) 12 (24%) 1 (2%) 15 (30%) 20 (40%) 1 (2%) 1 (2%)

8 (62%) 4 (31%) 5 (38%) 9 (69%) 4 (31%) 0 0 0 0 2 (15%) 0 7 (54%) 4 (31%) 1 (8%) 4 (31%) 4 (31%) 0 0

2 (25%) 0 6 (75%) 8 (100%) 1 (13%) 0 1 (13%) 1 (13%) 0 0 0 5 (63%) 6 (75%) 0 2 (25%) 0 0 0

3 (50%) 2 (33%) 3 (50%) 4 (67%) 2 (33%) 0 0 1 (17%) 0 0 0 3 (50%) 3 (50%) 0 2 (33%) 0 0 1 (17%)

4 (40%) 5 (50%) 6 (60%) 5 (50%) 4 (40%) 0 0 0 0 0 1 (10%) 5 (50%) 0 0 5 (50%) 5 (50%) 0 0

124 (34%) 105 (29%) 236 (66%) 255 (71%) 98 (27%) 4 (1%) 1 (0.3%) 24 (7%) 4 (1%) 5 (1%) 1 (0.3%) 223 (62%) 51 (14%) 15 (4%) 96 (27%) 187 (52%) 4 (1%) 7 (2%)

Downloaded by [Iowa State University] at 07:12 03 January 2013

and expectation maximization imputation; Bentler, 2006, pp. 278279). Only one article (2%) reported to have no missing data. Researchers are encouraged to report how missing data, if any, were addressed. These frequencies and patterns of missing data treatment were also observed across the four topics. Software As seen in Table 8, overall, the most often used software for SEM analysis was LISREL (40%), followed by EQS (30%) and Amos (24%). This order of frequency seems natural because LISREL is the oldest of the three and Amos is the newest. Another reason would be that authors using LISREL tend to be especially prolic. Among the 20 articles using LISREL, eight were by the same authors (four articles by one group of author[s], and the other four by another). For studies investigating motivation and social milieu, Amos was most frequently used (70% and 50%, respectively). Sample Size Table 9 shows a comparison of the actual median sample sizes of the 50 articles, using Klines (2005) guidelines. Six articles (12%) were found to be based on small sample sizes.

STRUCTURAL EQUATION MODELING: A REVIEW

267

TABLE 9 Sample Sizes Among 50 Articles by Klines (2005) Denition Small: 99 or Less All Strategy Motivation Social milieu Trait/test structure 6 (12%) 0 1 (13%) 0 0 Medium: 100199 13 (26%) 3 (23%) 2 (25%) 1 (17%) 4 (40%) Large: 200 or Above 31 (62%) 10 (77%) 5 (63%) 5 (83%) 6 (60%)

Note. To save space, we reported only results based on the article level (50 models) and did not report those based on the model level (360 models). This also applies to Table 10 and Table 11.

Downloaded by [Iowa State University] at 07:12 03 January 2013

TABLE 10 Actual and Desired Sample Sizes by Raykov and Marcoulides (2006) M SD Minimum Maximum Skewness Kurtosis Median Mode 258.5b 35.5 355 281 58 580 248 57 570 309.5 47.5 475 222 24 240

All 1. Actual sample size 851.96 2,266.44 2. No. of free parameters 46.24 31.41 3. Desired minimum sample sizea 462.40 314.06 Strategy 1. 490.54 426.82 2. 59.62 28.49 3. 596.15 284.91 Motivation 1. 804.25 1,607.61 2. 53.88 20.52 3. 538.80 205.20 Social milieu 1. 1054.17 1824.04 2. 55.67 31.87 3. 556.67 318.73 Trait structure 1. 452.80 708.81 2. 28.80 21.98 3. 288 219.84

75 10 100 102 10 100 75 30 300 137 24 240 110 13 130

15,000 132 1,320 1,382 124 1,240 4,765 93 930 4,765 95 950 2,450 90 900

5.40 1.09 1.09 1.56 0.67 0.67 2.78 0.76 0.76 2.41 0.48 0.48 3.05 2.90 2.90

32.31 0.48 0.48 1.57 1.47 1.47 7.78 0.78 0.78 5.86 2.10 2.10 9.46 8.86 8.86

561 10 100 561 63 630 57 570 110 24 240

Note. a Calculated using a formula of the number of free parameters times 10 (Raykov & Marcoulides, 2006). b When four articles with the largest sample sizes of 15,000, 5,000, 4,765, and 2,450 were excluded from analysis, the mean, median, and mode were 334.41, 251, and 561. When seven articles with the largest sample sizes of 15,000, 5,000, 4,765, 2,450, 1,382, 1,175, and 1,113 were excluded from analysis, the mean, median, and mode were 272.4, 237, and 561. These results show that the median was stable regardless of the extreme values.

Table 10 shows the descriptive statistics of actual and minimum desired sample sizes calculated based on Raykov and Marcoulides (2006). We interpreted median rather than mean, standard deviation, and mode, because mean and standard deviation were misleading given the large skewed and kurtotic distribution of actual sample sizes and because the mode was based only on two articles with the same sample size. The median sample size of the 50 articles was 258.5, a large sample size according to Kline (2005).

268

INNAMI AND KOIZUMI

Downloaded by [Iowa State University] at 07:12 03 January 2013

Table 10 also shows that the number of free parameters that needed to be estimated in each article had the median value of 35.5. Thus, the minimum desirable sample size based on the number of free parameters was 355 (35.5 10). A comparison of the actual (258.5) and minimum desirable sample size (355) across the 50 articles shows that, on average, the sample size used for structural equation models in the eld of language testing and learning was insufcient. This result held across the top four content areas. For example, the actual sample size for strategy research was 281, although the desirable sample size was at least 580. Further, more details were obtained by examining the differences between the actual and minimum desirable sample sizes for each article, as seen in Table 11. The differences ranged from 690 to 550, with extreme values excluded. The differences peaked around zero, with more articles located in the minus area, suggesting that the actual sample sizes for each article were often smaller than the minimum desired sample sizes; in fact, 29 (58%) of the 50 articles had an insufcient sample size. After having examined the adequacy of sample sizes using the two aforementioned criteria, we might be able to argue that the sample size of a study is insufcient if the sample size was judged to be insufcient by both the two aforementioned criteria (i.e., a sample size of 99 or below according to Kline, 2005, and the actual sample size smaller than the minimum desirable sample size according to Raykov & Marcoulides, 2006). We found that there were six such articles (12%); thus, we conclude that six of the 50 articles had insufcient sample sizes.

TABLE 11 Stem-and-Leaf Plot of Differences Between Actual and Desired Sample Sizes (N = 50) Frequency (%) 1 (2%) 4 (8%) 1 (2%) 1 (2%) 5 (10%) 6 (12%) 3 (6%) 8 (16%) 10 (20%) 3 (6%) 0 (0%) 0 (0%) 1 (2%) 1 (2%) 6 (12%) Stem and Leaf Extremes (Value: 959) 6.3499 5.7 4.6 3.02588 2.013478 1.046 0.01124557 0.0013446779 1.238 2. 3. 4.9 5.5 Extremes (Value: 752, 953, 2,240, 4,195, 4,240, and 13,680)

Note. The differences have been calculated by the actual sample size minus the minimum desired sample size using Raykov and Marcoulides (2006). Stem width is 100, and each leaf is 1 case; for example, 6.3, 6.4, 6.9, 6.9 in the second row means that differences between the actual and minimum desired sample sizes are 630, 640, 690, 690. This suggests that actual sample sizes were less than minimum desired and that these four articles should have extra sample sizes of 630, 640, 690, and 690 in addition to the original sample sizes.

STRUCTURAL EQUATION MODELING: A REVIEW

269

Although researchers rarely justify the sample size of their study, we found two exceptions. In one article containing two models, the authors stated that their sample size was adequate since it was more than 200. In the other article containing one model, the authors defended their sample size by stating it was larger than 10 times the number of free model parameters. Justication for sample sizes should be encouraged more. Others Although not explicit from Tables 5 to 10, we found that the same author(s) tended to use the same analytical procedure for SEM. For example, if one does not check multivariate normality of the data in ones early work, one does not do in ones later work either. This suggests that if guided appropriately in the early stages of their academic careers, authors would continue to follow good practice. Thus, for example, when an author submits an article for review for the rst time, the editor(s) or reviewers can encourage the author to report on the normality of data distribution. That author is then more likely to report such information when writing a later article.

Downloaded by [Iowa State University] at 07:12 03 January 2013

SUMMARY AND IMPLICATIONS The use of SEM in language testing and learning research has recently increased. Thus, as Kunnan (1998) argued, its use needs to be periodically reviewed so that a new model can rmly build on existing models that were tested using a proper procedure. We therefore examined the characteristics of use of SEM in language testing and learning research. In the 20 internationally accessible published journals on language testing and learning we reviewed, 360 models from 50 articles were found in nine journals and inspected in terms of (a) content analysis, (b) parameter estimation methods, (c) the types and frequency of model t indices used, (d) normality checks, (e) missing data treatment, (f) sample sizes, and (g) software used. The content analysis showed that the most researched area using SEM was strategy, which was followed by trait/test structure. Regarding parameter estimation, maximum likelihood methods were most often used. With respect to model t, chi-squares (and p values and degrees of freedom associated with them), CFI, RMSEA, and TLI were often reported. SRMR was rarely used. Although multiple t indices were usually reported, few studies reported overall exemplary combinations of t indices. In addition, univariate and multivariate normality checks were infrequently reported. Missing data treatment was also infrequently reported. When it was reported, listwise deletion and full information maximum likelihood estimation were often used. According to Klines (2005) guidelines, a median sample size of 258.5, as observed in actual studies, was adequate for SEM analysis; however, the difference of actual (258.5) and minimum desired sample sizes (355) based on the number of parameters (e.g., Raykov & Marcoulides, 2006) suggested the need to have larger sample sizes. Combining these two criteria, six of the 50 articles were found to have insufcient sample sizes. Nevertheless, given that models with varying characteristics (e.g., the amount and patterns of missing data and the strength of the relationships among the indicators) have been used in the language testing and learning literature, and given that the sample size is inuenced by these characteristics, our ndings based on these two guidelines need to be

270

INNAMI AND KOIZUMI

interpreted carefully. Further, LISREL was more widely used as compared to EQS and Amos. These results were generally the same across the 50 studies and across the top four content areas of strategy, motivation, social milieu, and trait/test structure. Two recommendations for better practice in terms of using and reporting SEM for language testing and learning research are discussed. First, all topics in the language testing and learning elds, parts of which were investigated in previous studies and are shown in Table 4, would merit further investigation using SEM. We particularly suggest three topics that may be well addressed applying SEM. To begin, we suggest conducting more research into the stability of a model across samples to examine whether the same constructs are measured across samples. This issue is especially relevant to studies involving learners of varying characteristics, especially for large-scale, high-stakes tests intended for heterogeneous test-takers. Data can be analyzed using multisample analysis, and results can provide insight into the fairness of test score interpretation across learners (see Purpura, 1998; Shin, 2005). Moreover, we suggest conducting studies on longitudinal change in language prociency. Although they are as important as cross-sectional studies, it is only recently that longitudinal studies have received growing attention as a means for advancing our knowledge of language acquisition processes (Ortega & Byrnes, 2008). Insights that are gained from longitudinal research may be useful in developing prociency descriptors such as Can-do statements and forming sound educational practices. For this purpose, latent growth modeling allows for the evaluation of change over time on group as well as individual levels (see Matsumura, 2003; Ross, 2005). Finally, we suggest more investigations into sources of variability in performance assessment. The data in performance assessment often have a hierarchical structure where observations are nested within higher levels of classication (e.g., ratings are nested within raters). There has been little empirical research that considers such nested structure, despite the sustained interest in how rater attributes are related to systematic variance in ratings of speaking or writing performance. In this case, multilevel modeling is useful in examining variables of interest at each hierarchical level (see Barkaoui, 2010). Second, we suggest that authors use appropriate procedures and provide sufcient information about and justication for ve perspectives: (a) parameter estimation methods, (b) model t indices used, (c) normality checks, (d) missing data treatment, and (e) sample sizes. The rst perspective is regarding the type of estimation method; our study showed that 30% of the articles were unclear regarding the estimation method used. This needs to be rectied, as choice of estimation affects subsequent analyses (for details, see Ullman, 2007) and is easy to report. We also strongly encourage researchers using SEM to pay closer attention to Finney and DiStefano (2006) for nonnormal or categorical data issues. Although these authors recommend the robust weighted least squares estimator if the variables have four or less categories, this estimator is available only in Mplus. One needs to be familiar with Mplus or needs to reconsider increasing the number of ordered scale categories to ve or more. The second perspective is as regards t indices; we showed that some authors used different t indices depending on the models, even when using the same data within a series of analyses. However, such practice cannot be recommended unless a clear explanation is provided as to why some t indices are more appropriate for evaluating certain kinds of models. In addition, more attention needs to be paid to SRMR. We suggest routine reporting of SRMR in evaluating model t. As can be clearly seen from our results, SRMR is rarely used in language testing and learning, in contrast to Hu and Bentlers (1998, 1999) recommendation to report the index. Because intuitive interpretability of SRMR is appealing to researchers unfamiliar with t indices

Downloaded by [Iowa State University] at 07:12 03 January 2013

STRUCTURAL EQUATION MODELING: A REVIEW

271

and because SRMR can be calculated using SEM programs, this index needs to be reported more often. SRMR appears in the output if one uses Amos, CALIS, EQS, LISREL, or Mplus (see the Types of Fit Indices Used section). Finally, we encourage researchers to follow reporting guidelines for combinations of t indices recommended by researchers such as Hu and Bentler (1998, 1999) and others (see the Five Steps in an SEM Application section). These combinations consist of t indices of different nature, thereby assessing data-model t from various aspects. Third, we suggest that authors examine data normality before the main data analysis is run. Univariate normality can be checked by calculating the skewness and kurtosis of each variable. Multivariate normality, especially multivariate kurtosis, can be checked by calculating Mardias estimate using SEM programs. Skewed and/or kurtotic distribution of a variable can be transformed if normal distribution is expected in the population (see, e.g., Kline, 2005). If nonnormal distribution is expected in the population, skewed and/or kurtotic data can be analyzed using estimation methods such as the robust maximum likelihood estimation available in EQS, LISREL, and Mplus. Regarding the fourth perspective, we suggest that authors check if there are any missing cases in their data. Perhaps the easiest way is to invoke Microsoft Excels nd function (i.e., the control key + f), enter nothing (just leave the search window blank), and press the nd button. The activated cell will be moved to any blank cell showing missing responses. Although listwise and pairwise methods were commonly used, a more appropriate way is to use full information maximum likelihood or expectation maximization methods (see, e.g., Enders, 2001). These can be carried out in SEM programs. The fth perspective is about sample sizes. We suggest that authors refer to guidelines on sample sizes, such as in Kline (2005) and Raykov and Marcoulides (2006), when reporting on sample sizes for their studies. Although there is no universally agreed-upon sample size needed for analysis, mentioning these guidelines would help readers understand authors decisions about sample sizes. Of course, not all language testers and language learning researchers are versed in SEM, nor are they expected to keep up with the latest SEM methodological developments. Further, space limitations may prevent complete reporting of analysis details. However, at least the latter recommendation, which includes the aforementioned ve perspectives, would be essential to take complete advantage of SEM. Further, relatively little space is required for reporting results related to these ve perspectives. More details on these recommendations and other relevant issues are found in introductory books such as Byrne (2006) and Kline (2005), which are highly readable. In conclusion, these changes would not happen overnight, but we believe that paying close attention to the ve perspectives would promote the proper use and reporting of SEM among researchers in the elds of language testing and learning and would eventually lead to the accumulation and advancement of knowledge in the eld.

Downloaded by [Iowa State University] at 07:12 03 January 2013

ACKNOWLEDGMENTS An earlier version of this article was presented at the American Association for Applied Linguistics 2011 Conference, Chicago, Illinois. We thank Hiroaki Maeda and the three anonymous reviewers for their valuable comments on earlier versions of this article.

272

INNAMI AND KOIZUMI

REFERENCES
References marked with an asterisk indicate studies included in the review. Allison. P. D. (2003). Missing data techniques for structural equation modeling. Journal of Abnormal Psychology, 112, 545557. Arbuckle, J. L. (19942011). Amos [Computer software]. Chicago, IL: SPSS. Arbuckle, J. L. (1999). Amos (Version 4.01) [Computer software]. Chicago, IL: Smallwaters. Bachman, L. F. (1982). The trait structure of cloze test scores. TESOL Quarterly, 16, 6170. Bachman, L. F. (1998). Appendix: Language testingSLA research interfaces. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research (pp. 177195). New York, NY: Cambridge University Press. Bachman, L. F., & Eignor, D. R. (1997). Recent advances in quantitative test analysis. In C. Clapham & D. Corson (Eds.), Encyclopedia of language and education (Vol. 7): Language testing and assessment (pp. 227242). Dordrecht, the Netherlands: Kluwer. Bachman, L. F., & Palmer, A. S. (1981). The construct validation of the FSI oral interview. Language Learning, 31, 6786. Bachman, L. F., & Palmer, A. (1982). The construct validation of some components of communicative prociency. TESOL Quarterly, 16, 449465. Bachman, L. F., & Palmer, A. (1989). The construct validation of self-ratings of communicative language ability. Language Testing, 6, 1429. Bae, J., & Bachman, L. F. (1998). A latent variable approach to listening and reading: Testing factorial invariance across two groups of children in the Korean/English Two-Way Immersion Program. Language Testing, 15, 380414. Bandalos, D. L., & Finney, S. J. (2010). Factor analysis: Exploratory and conrmatory. In G. R. Hancock & R. O. Mueller (Eds.), The reviewers guide to quantitative methods in the social sciences (pp. 93114). New York, NY: Routledge. Barkaoui, K. (2010). Explaining ESL essay holistic scores: A multilevel modeling approach. Language Testing, 27 , 515535. Bentler, P. M. (2006). EQS 6 structural equations program manual. Encino, CA: Multivariate Software. Bentler, P. M. (19942011). EQS for Windows [Computer software]. Encino, CA: Multivariate Software. Blais, J. G., & Laurier, M. D. (1995). The dimensionality of a placement test from several analytical perspectives. Language Testing, 12, 7298. Bollen, K. A. (1989a). A new incremental t index for general structural equation models. Sociological Methods and Research, 17 , 303316. Bollen, K. A. (1989b). Structural equations with latent variables. New York, NY: Wiley. Bollen, K. A., & Long, J. S. (1993). Introduction. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 19). Newbury Park, CA: Sage. Boomsma, A. (2000). Reporting analyses of covariance structures. Structural Equation Modeling, 7 , 461483. Breckler, S. J. (1990). Applications of covariance structure modeling in psychology: Cause for concern? Psychological Bulletin, 107 , 260273. Brown, T. A. (2006). Conrmatory factor analysis for applied research. New York, NY: Guilford. Byrne, B. M. (2001). Structural equation modeling with AMOS: Basic concepts, applications, and programming. Mahwah, NJ: Erlbaum. Byrne, B. M. (2006). Structural equation modeling with EQS: Basic concepts, applications, and programming (2nd ed.). Mahwah, NJ: Erlbaum. Carr, N. T. (2006). The factor structure of test task characteristics and examinee performance. Language Testing, 23, 269289. Chen, J. F., Warden, C. A., & Chang, H.-T. (2005). Motivators that do not motivate: The case of Chinese EFL learners and the inuence of culture on motivation. TESOL Quarterly, 39, 609633. Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-t indexes for testing measurement invariance. Structural Equation Modeling, 9, 233255. Choi, I.-C., Kim, K. S., & Boo, J. (2003). Comparability of a paper-based language test and a computer-based language test. Language Testing, 20, 295320.

Downloaded by [Iowa State University] at 07:12 03 January 2013

STRUCTURAL EQUATION MODELING: A REVIEW

273

K., & Drnyei, Z. (2005). The internal structure of language learning motivation and its relationship with language choice and learning effort. Modern Language Journal, 89, 1936. Csizr, K., & Kormos, J. (2009). Modelling the role of inter-cultural contact in the motivation of learning English as a foreign language. Applied Linguistics, 30, 166185. Ding, L., Velicer, W. F., & Harlow, L. L. (1995). Effects of estimation methods, number of indicators per factor and improper solutions on structural equation modeling t indices. Structural Equation Modeling, 2, 119143. Drnyei, Z. (2007). Research methods in applied linguistics. Oxford, UK: Oxford University Press. Eckes, T., & Grotjahn, R. (2006). A closer look at the construct validity of C-tests. Language Testing, 23, 290325. Ellis, R., & Loewen, S. (2007). Conrming the operational denitions of explicit and implicit knowledge in Ellis (2005): Responding to Isemonger. Studies in Second Language Acquisition, 29, 119126. Enders, C. K. (2001). A primer on maximum likelihood algorithms available for use with missing data. Structural Equation Modeling, 8, 128141. Fan, X., & Sivo, S. A. (2005). Sensitivity of t indexes to misspecied structural or measurement model components: Rationale of two-index strategy revisited. Structural Equation Modeling, 12, 343367. Finney, S. J., & DiStefano, C. (2006). Non-normal and categorical data in structural equation modeling. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling: A second course (pp. 269314). Greenwich, CT: Information Age. Fornell, C. (1982). A second generation of multivariate analysis: An overview. In C. Fornell (Ed.), A second generation of multivariate analysis (Vol. 1, pp. 121). New York, NY: Praeger. Gardner, R. C., Tremblay, P. F., & Masgoret, A. M. (1997). Towards a full model of second language learning: An empirical investigation. Modern Language Journal, 81, 344362. Garson, D. (2009). Structural equation modeling. Retrieved from http://faculty.chass.ncsu.edu/garson/PA765/structur.htm Ginther, A., & Stevens, J. (1998). Language background, ethnicity, and the internal construct validity of the Advanced Placement Spanish. In A. J. Kunnan (Ed.), Validation in language assessment (pp. 169194). Mahwah, NJ: Erlbaum. Gorsuch, G. J. (2000). EFL educational policies and educational cultures: Inuences on teachers approval of communicative activities. TESOL Quarterly, 34, 675710. Hoyle, R. A., & Panter, A. T. (1995). Writing about structural equation models. In R. H. Hoyle (Ed.), Structural equation modeling: Concepts, issues, and applications (pp. 158176). Thousand Oaks, CA: Sage. Hsiao, T.-Y., & Oxford, R. L. (2002). Comparing theories of language learning strategies: A conrmatory factor analysis. Modern Language Journal, 86, 368383. Hu, L.-T., & Bentler, P. M. (1995). Evaluating model t. In R. H. Hoyle (Ed.), Structural equation modeling: Concepts, issues, and applications (pp. 7699). Thousand Oaks, CA: Sage. Hu, L.-T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecication. Psychological Methods, 3, 424453. Hu, L.-T., & Bentler, P. M. (1999). Cutoff criteria for t indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 155. Hulland, J., Chow, Y.-H., & Lam, S. (1996). Use of causal models in marketing research: A review. International Journal of Research in Marketing, 13, 181197. Innami, Y. (2006). The effects of test anxiety on listening test performance, System, 34, 317340. Innami, Y., & Koizumi, R. (2010). Can structural equation models in second language testing and learning research be successfully replicated? International Journal of Testing, 10, 262273. Isemonger, I., & Watanabe, K. (2007). The construct validity of scores on a Japanese version of the perceptual component of the Style Analysis Survey. System, 35, 134147. Jackson, D. L., Gillaspy Jr., J. A., & Purc-Stephenson, R. (2009). Reporting practices in conrmatory factor analysis: An overview and some recommendations. Psychological Methods, 14, 623. Jreskog, K. G., & Srbom, D. (19742011). LISREL for Windows [Computer software]. Lincolnwood, IL: Scientic Software International. Kahn, J. H. (2006). Factor analysis in counseling psychology research, training, and practice: Principles, advances, and applications. The Counseling Psychology, 34, 684718. Kashy, D. A., Donnellan, M. B., Ackerman, R. A., & Russell, D. W. (2009). Reporting and interpreting research in PSPB: Practices, principles, and pragmatics. Personality and Social Psychology Bulletin, 35, 11311142. Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). New York, NY: Guilford.

Csizr,

Downloaded by [Iowa State University] at 07:12 03 January 2013

274
Kunnan,

INNAMI AND KOIZUMI

A. J. (1994). Modelling relationships among some test-taker characteristics and performance on EFL tests: An approach to construct validation. Language Testing, 11, 225250. Kunnan, A. J. (1995). Test taker characteristics and test performance: A structural equation modeling approach. Cambridge, UK: Cambridge University Press. Kunnan, A. J. (1998). An introduction to structural equation modelling for language assessment research. Language Testing, 15, 295332. Lee, S.-y. (2005). Facilitating and inhibiting factors in English as a foreign language writing performance: A model testing with structural equation modeling. Language Learning, 55, 335374. Liao, Y.-F. (2009). A construct validation study of the GEPT reading and listening sections: Re-examining the models of L2 reading and listening abilities and their relations to lexico-grammatical knowledge (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No. 3348356) Llosa, L. (2007). Validating a standards-based classroom assessment of English prociency: A multitrait-multimethod approach. Language Testing, 24, 489515. Lomax, R. G. (2010). Structural equation modeling: Multisample covariance and mean structures. In G. R. Hancock & R. O. Mueller (Eds.), The reviewers guide to quantitative methods in the social sciences (pp. 385395). New York, NY: Routledge. Lumley, T., & Brown, A. (2005). Research methods in language testing. In E. Hinkel (Ed.), Handbook of research in second language teaching and learning (pp. 833855). Mahwah, NJ: Erlbaum. MacCallum, R. C., & Austin, J. T. (2000). Application of structural equation modeling in psychological research. Annual Review of Psychology, 51, 201226. Martens, M. P. (2005). The use of structural equation modeling in counseling psychology research. The Counseling Psychologist, 33, 269298. Matsumura, S. (2003). Modelling the relationships among interlanguage pragmatic development, L2 prociency, and exposure to L2. Applied Linguistics, 24, 465491. McDonald, R. P. (1989). An index of goodness-of-t based on noncentrality. Journal of Classication, 6, 97103. McDonald, R. P., & Ho, M-H. R. (2002). Principles and practice in reporting structural equation analyses. Psychological Methods, 7 , 6482. Mueller, R. O., & Hancock, G. R. (2008). Best practices in structural equation modeling. In J. W. Osborne (Ed.), Best practices in quantitative methods (pp. 488508). Thousand Oaks, CA: Sage. Mueller, R. O., & Hancock, G. R. (2010). Structural equation modeling. In G. R. Hancock & R. O. Mueller (Eds.), The reviewers guide to quantitative methods in the social sciences (pp. 371383). New York, NY: Routledge. Muthn, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are not missing completely at random. Psychometrika, 52, 431462. Muthn, L. K., & Muthn, B. O. (19982011). Mplus [Computer software]. Los Angeles, CA: Muthn & Muthn. Muthn, L. K., & Muthn, B. O. (2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling, 9, 599620. Onwuegbuzie, A. J., Bailey, P., & Daley, C. E. (2000). The validation of three scales measuring anxiety at different stages of the foreign language learning process: The input anxiety scale, the processing anxiety scale, and the output anxiety scale. Language Learning, 50, 87117. Ortega, L. & Byrnes, H. (2008). The longitudinal study of advanced L2 capacities. New York, NY: Routledge. Pae, T.-I., & Park, G.-P. (2006). Examining the relationship between differential item functioning and differential test functioning. Language Testing, 23, 475496. Phakiti, A. (2008a). Construct validation of Bachman and Palmers (1996) strategic competence model over time in EFL reading tests, Language Testing, 25, 237272. Phakiti, A. (2008b). Strategic competence as a fourth-order factor model: A structural equation modeling approach. Language Assessment Quarterly, 5, 2042. Purpura, J. E. (1997). An analysis of the relationships between test takers cognitive and metacognitive strategy use and second language test performance. Language Learning, 47 , 289325. Purpura, J. E. (1998). Investigating the effects of strategy use and second language test performance with high- and low-ability test takers: A structural equation modelling approach. Language Testing, 15, 333379. Raykov, T., & Marcoulides, G. A. (2006). A rst course in structural equation modeling (2nd ed.). Mahwah, NJ: Erlbaum. Ross, S. J. (2005). The impact of assessment method on foreign language prociency growth. Applied Linguistics, 26, 317342.

Downloaded by [Iowa State University] at 07:12 03 January 2013

STRUCTURAL EQUATION MODELING: A REVIEW

275

Russell, D. W. (2002). In search of underlying dimensions: The use (and abuse) of factor analysis in Personality and Social Psychology Bulletin. Personality and Social Psychology Bulletin, 28, 16291646. Sang, F., Schmitz, B., Vollmer, H. J., Baumert, J., & Roeder, P. M. (1986). Models of second language competence: A structural equation approach. Language Testing, 3, 5479. SAS Institute. (19902011). SAS PROC CALIS [Computer software]. Cary, NC: Author. Sasaki, M. (1993). Relationships among second language prociency, foreign language aptitude, and intelligence: A structural equation modeling approach. Language Learning, 43, 313344. Sasaki, M. (1996). Second language prociency, foreign language aptitude, and intelligence: Quantitative and qualitative analyses. New York, NY: Peter Lang. Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment: Reporting a score prole and a composite. Language Testing, 24, 355390. Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22, 130. Schoonen, R., Hulstijn, J., & Bossers, B. (1998). Metacognitive and language-specic knowledge in native and foreign language reading comprehension: An empirical study among Dutch students in Grades 6, 8 and 10. Language Learning, 48, 71106. Schoonen, R., van Gelderen, A., de Glopper, K., Hulstijn, J., Simis, A., Snellings, P., & Stevenson, M. (2003). First language and second language writing: The role of linguistic knowledge, speed of processing, and metacognitive knowledge. Language Learning, 53, 165202. Schoonen, R., Vergeer, M., & Eiting, M. (1997). The assessment of writing ability: Expert readers versus lay readers. Language Testing, 14, 157184. Shin, S.-K. (2005). Did they take the same test? Examinee language prociency and the structure of language tests. Language Testing, 22, 3157. Shiotsu, T., & Weir, C. J. (2007). The relative signicance of syntactic knowledge and vocabulary breadth in the prediction of reading comprehension test performance. Language Testing, 24, 99128. Song, M.-Y. (2008). Do divisible subskills exist in second language (L2) comprehension? A structural equation modeling approach. Language Testing, 25, 435464. Steiger, J. H. (1990). Structural model evaluation and modication: An interval estimation approach. Multivariate Behavioral Research, 25, 173180. Tremblay, P. F., & Gardner, R. C. (1995). Expanding the motivation construct in language learning. Modern Language Journal, 79, 505518. Tseng, W.-T., Drnyei, Z., & Schmitt, N. (2006). A new approach to assessing strategic learning: The case of selfregulation in vocabulary acquisition. Applied Linguistics, 27 , 78102. Tseng, W.-T., & Schmitt, N. (2008). Toward a model of motivated vocabulary learning: A structural equation modeling approach. Language Learning, 58, 357400. Turner, C. E. (1989). The underlying factor structure of L2 close test performance in Francophone, university-level students: Causal modelling as an approach to construct validation. Language Testing, 6, 172197. Ullman, J. E. (2007). Structural equation modeling. In B. G. Tabachnick & L. S. Fidell (Eds.), Using multivariate statistics (5th ed., pp. 676780). Boston, MA: Allyn and Bacon. Vandergrift, L., Goh, C. C. M., Mareschal, C. J., & Tafaghodtari, M. H. (2006). The metacognitive awareness listening questionnaire: Development and validation. Language Learning, 56, 431462. Widaman, K. F. (2010). Multitrait-multimethod analysis. In G. R. Hancock & R. O. Mueller (Eds.), The reviewers guide to quantitative methods in the social sciences (pp. 299314). New York, NY: Routledge. Williams, L., Edwards, J., & Vandenberg, R. (2003). Recent advances in causal modeling methods for organizational and management research. Journal of Management, 29, 903936. Woodrow, L. (2006a). Anxiety and speaking English as a second language. RELC Journal, 37 , 308328. Woodrow, L. J. (2006b). A model of adaptive language learning. Modern Language Journal, 90, 297319. Worthington, R. L., & Whittaker, T. A. (2006). Scale development research: A content analysis and recommendations for best practices. The Counseling Psychologist, 34, 806838. Xi, X. (2005). Do visual chunks and planning impact performance on the graph description task in the SPEAK exam? Language Testing, 22, 463508. Yamashiro, A. D. (2002). Using structural equation modeling for construct validation of an English as a foreign language public speaking rating scale. (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No. 3057127)

Downloaded by [Iowa State University] at 07:12 03 January 2013

276
Yashima,

INNAMI AND KOIZUMI

T. (2002). Willingness to communicate in a second language: The Japanese EFL context. Modern Language Journal, 86, 5466. Yashima, T., Zenuk-Nishide, L., & Shimizu, K. (2004). The inuence of attitudes and affect on willingness to communicate and second language communication. Language Learning, 54, 1, 119152. Yuan, K.-H. (2005). Fit indices versus test statistics. Multivariate Behavioral Research, 40, 115148. Zumbo, B. D. (2003). Does item-level DIF manifest itself in scale-level analyses? Implications for translating language tests. Language Testing, 20, 136147.

Downloaded by [Iowa State University] at 07:12 03 January 2013

Вам также может понравиться