Вы находитесь на странице: 1из 24

The Journal of Experimental Education

ISSN: 0022-0973 (Print) 1940-0683 (Online) Journal homepage: https://www.tandfonline.com/loi/vjxe20

Estimating standard errors of IRT true score


equating coefficients using imputed item
parameters

Zhonghua Zhang

To cite this article: Zhonghua Zhang (2020): Estimating standard errors of IRT true score
equating coefficients using imputed item parameters, The Journal of Experimental Education, DOI:
10.1080/00220973.2020.1751579

To link to this article: https://doi.org/10.1080/00220973.2020.1751579

Published online: 15 Apr 2020.

Submit your article to this journal

Article views: 6

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=vjxe20
THE JOURNAL OF EXPERIMENTAL EDUCATION
https://doi.org/10.1080/00220973.2020.1751579

MEASUREMENT, STATISTICS, AND RESEARCH DESIGN

Estimating standard errors of IRT true score equating


coefficients using imputed item parameters
Zhonghua Zhang
Melbourne Graduate School of Education, The University of Melbourne, Carlton, Victoria, Australia

ABSTRACT KEYWORDS
Reporting standard errors of equating has been advocated as a standard Bootstrap method; delta
practice when conducting test equating. The two most widely applied pro- method; equating; item
cedures for standard errors of equating including the bootstrap method response theory; multiple
imputation method;
and the delta method are either computationally intensive or confined to standard errors; true scores
the derivations of complicated formulas. In the current study, a hypothet-
ical example was used to illustrate how the multiple imputation method
could be taken as an alternative procedure for obtaining the standard
errors for the item response theory (IRT) true score equating coefficients in
the context of the common-item nonequivalent groups equating design
under the three-parameter logistic IRT model. This method makes use of
multiple sets of imputed item parameter values. By using the simulated
and real data, the performance of the multiple imputation method was
examined and compared with that of the bootstrap and delta methods.
The results indicated that the multiple imputation method performed as
effectively as the bootstrap method and the delta method when using the
characteristic curve methods. The multiple imputation method produced
very similar results to the delta method when the moment methods
were used.

ITEM RESPONSE THEORY (IRT, Hambleton et al., 1991) equating is necessary when test scores
that are derived based on IRT from alternate test forms developed to measure the common con-
tent need to be placed onto a common measurement scale for producing comparable scores
across these different test forms (Kolen & Brennan, 2004). IRT equating typically involves at least
two steps: (1) placing the item and/or person parameters onto the same measurement scale, and
(2) equating the observed- and/or true-scores. The amount of random error in the estimation of
the equating coefficients can be indexed by the standard errors of equating (Kolen & Brennan,
2004). The reporting of standard errors of equating has been advocated as a standard practice in
practical test equating (American Education Research Association et al., 2014). The bootstrap
method (Tsai et al., 2001) and the delta method (Ogasawara, 2000, 2001a, 2001b) are the two
most widely applied procedures for obtaining the standard errors for the IRT equating coeffi-
cients. However, the bootstrap method suffers from the issue of being too computationally inten-
sive. The delta method is completely dependent on the derivations of complicated mathematical
formulas for the standard error expressions. Therefore, both methods could be considered prac-
tically intractable in some cases where complicated linking designs, IRT models, and/or equating
methods are employed (Kolen & Brennan, 2004; Zhang & Zhao, 2019). Recently, Zhang and
Zhao (2019) proposed a multiple imputation method which makes use of the random

CONTACT Zhonghua Zhang zhonghua.zhang@unimelb.edu.au Melbourne Graduate School of Education, The University
of Melbourne, Level 8, 100 Leicester Street, Carlton, Victoria 3053, Australia.
ß 2020 Taylor & Francis Group, LLC
2 Z. ZHANG

imputations of the item parameters to estimate the standard errors for the IRT parameter scale
transformation coefficients. This multiple-imputation based procedure is less computationally
intensive than the bootstrap method and also doesn’t rely on the derivation of complicated for-
mulas as the delta method. Zhang and Zhao (2019) used both the simulated and real data to
examine and compare the performance of the multiple imputation method with that of the boot-
strap method and the delta method for the two-parameter logistic IRT (2PL IRT) model in the
context of the common-item nonequivalent groups (CINEG, Kolen & Brennan, 2004) equating
design. The results indicated that the multiple imputation method could be taken as a practically
viable alternative to the bootstrap method and the delta method to determine the standard errors
for the IRT parameter scale transformation coefficients in most simulated conditions.
Notwithstanding the proven utility of the method for producing standard errors for the esti-
mates of the IRT parameter scale transformation coefficients under the 2PL IRT model, it still
remains a question about whether this approach can replace the bootstrap method and the delta
method to produce the standard errors for the estimates of the IRT true score equating coeffi-
cients as well as for other IRT models (Zhang & Zhao, 2019). In test equating, the IRT true score
equating coefficients are derived from the equated item parameters (Kolen & Brennan, 2004).
Therefore, the equated true score coefficients, like the IRT parameter scale transformation coeffi-
cients, are functions of the item parameters. The multiple imputation method, which makes use
of multiple sets of imputed item parameters to approximate the true distribution of the item
parameters, has been shown to perform as effectively as the bootstrap method and the delta
method in obtaining the standard errors for the IRT parameter scale transformation coefficients.
It is therefore hypothesized that the multiple imputation can also be undertaken as an approach
to determining the impact of the uncertainty of the item parameters due to sampling in the cali-
bration process on the IRT true score equating coefficients. In addition, despite being introduced
with the 2PL IRT model (Zhang & Zhao, 2019), in theory, the multiple imputation method could
also be employed as a standard error estimation procedure in test equating for other IRT models
such as the three-parameter logistic IRT (3PL IRT, Birnbaum, 1968) if the maximum likelihood
estimation method is used for item calibrations (e.g., Bock & Aitkin, 1981). In this study, a hypo-
thetical example was first used to illustrate how the multiple imputation method could be applied
to obtain the standard errors for the IRT true score equating coefficients in test equating under
the 3PL IRT model in the context of the CINEG equating design. The performance of the mul-
tiple imputation method was then examined and compared with that of the bootstrap method
and the delta method by using both the simulated and real data.
The CINEG equating design considered in the current study, similar to Zhang and Zhao
(2019); Ogasawara (2000, 2001a, 2001b) and Wong (2015), assumed that there were two test
forms, Form U and Form V, which shared a set of internal common items. The two test forms
were administered to two groups, Group 1 and Group 2, respectively. These two groups were
nonequivalent in terms of the distribution of ability. The IRT true scores on Form V for Group 2
were equated to the scale of the IRT true scores on Form U for Group 1.

IRT True Score Equating


The 3PL IRT model (Birnbaum, 1968) considered in the current study is one of the most popular
IRT models for analyzing the dichotomously scored multiple choice items. Under the 3PL IRT
model, the probability of an examinee correctly answering an item i in a test can be expressed as
eai ðhbi Þ
PðXi ¼ 1jh; ai , bi , ci Þ ¼ ci þ ð1  ci Þ (1)
1 þ eai ðhbi Þ
where Xi ¼ 1 denotes a correct response of the examinee to item i; h is the level of the latent trait
for the examinee; ai , bi and ci are the discrimination, difficulty and pseudo-guessing parameters
THE JOURNAL OF EXPERIMENTAL EDUCATION 3

for item i, respectively. To resolve the indeterminacy of the location and scale for the parameters
in IRT calibration, it is a common practice to constrain that the person ability parameters (h) have
a mean of zero and variance of one. Therefore, in the context of the CINEG equating design
employed in the study, if IRT calibrations are run separately on Form U for Group 1 and Form V
for Group 2, the estimated item parameters for the items in the two test forms will not be on the
same measurement scale. Consequently, the IRT true scores on the two test forms, which are calcu-
lated based on the item parameters, will be invalid for direct comparison. Therefore, to produce
comparable scores between the two test forms, IRT true score equating needs to be conducted.
As stated above, IRT true score equating typically involves two steps (Kolen & Brennan, 2004).
In the first step, the estimates of the item parameters for the items in Test V are firstly placed
onto the metric of Form U. This can be achieved via conducting a set of linear transformations
on the estimates of the item parameters for the items in Form V. For the 3PL IRT model, the lin-
ear transformation functions normally have two coefficients: the slope coefficient A and the inter-
cept coefficient B: There are at least two types of methods that can be used to find A and B : the
moment methods and the characteristic curve methods (Kolen & Brennan, 2004). The moment
methods utilize the moments (mainly mean and standard deviation) of the item parameter esti-
mates for the common items to estimate A and B: The Mean/Mean approach (Loyd & Hoover,
1980), which uses the means of the item discrimination parameters and the means of the item
difficulty parameters, and the Mean/Sigma approach (Marco, 1977), which uses the means and
standard deviations of the item difficulty parameters, are the two mostly used moment methods.
The characteristic curve methods make good use of the response functions for the common items
to estimate A and B: The Haebara approach (Haebara, 1980) and the Stocking-Lord approach
(Stocking & Lord, 1983), which are also known as the item characteristic curve approach and the
test characteristic curve approach respectively, are the two most widely utilized characteristic
curve methods. The Haebara approach obtains A and B by finding the optimized values that min-
imize a loss function which describes the differences between the original and the transformed
item characteristic curves for the common items in the two test forms. The Stocking-Lord
approach derives A and B by finding the optimized values that minimize a loss function that
measures the difference between the test characteristic curves for the common items in the two
test forms. Previous studies have indicated that generally the characteristic curve methods pro-
duced more accurate and stable estimates than the moment methods (Baker & Al-Karni, 1991;
Stocking & Lord, 1983). More details about the moment methods and the characteristic curve
methods can be found in Kolen and Brennan (2004).
The second step of IRT true score equating involves linking the true scores. After placing the item
parameters for the items in Form V onto the scales of Form U by using the two parameter scale
transformation coefficients (i.e., A and B) that are obtained in the first step, the IRT true scores of
Form V that are calculated based on the equated item parameters can be linked to the scale of the
true scores of Form U. The IRT true score functions for Form U and Form V can be expressed as
X
nU
nðhÞ ¼ PðXi ¼ 1jh; ai , bi , ci Þ (2)
i¼1

and
X
nV
gðhÞ ¼ PðXi ¼ 1jh; ai , bi , ci Þ (3)
i¼1

respectively, where nU and nV in Equations 2 and 3 are the number of items in Form U and
Form V respectively, and in Equation 3,
 
exp aAi ðh  Abi  BÞ
PðXi ¼ 1jh; ai , bi , ci Þ ¼ ci þ ð1  ci Þ   (4)
1 þ exp aAi ðh  Abi  BÞ
4 Z. ZHANG

To equate the IRT true scores, for a given true score n on Form U, the corresponding estimate
of the equated true score ^g on Form V can be found in the following two steps (Kolen &
Brennan, 2004). First, the numerical approximation method is used to find the estimate of h (i.e.,
^
h) that corresponds to the given true score n in Equation 2. Second, the estimate of the equated
g on Form V is determined by replacing h with ^h in Equation 3.
true score ^

Methods for Estimating the Standard Errors of the IRT True Score Equating
Coefficients
Given that the IRT true score equating coefficients are derived based on the item parameters,
they are hence subject to errors carried over from item calibrations due to sampling variation
(Battauz, 2015a). This means that the estimates of the true score equating coefficients could be
variant if different equating samples are involved. The variability of the estimates of the true score
equating coefficients in test equating can be indexed by the standard errors of the IRT true score
equating coefficients (Kolen & Brennan, 2004). As stated above, one of the primary purposes of
the study is to introduce the multiple imputation method (Zhang & Zhao, 2019) as an alternative
to the bootstrap method (Kolen, 1985; Kolen & Brennan, 2004; Tsai et al., 2001) and the delta
method (Battauz, 2013; Ogasawara, 2001a; Wong, 2015) to obtain the standard errors for the esti-
mates of the IRT true score equating coefficients. In the following sections, the application of the
multiple imputation method is firstly illustrated and then followed by the introduction of the
bootstrap and delta methods as well as the comparisons across the three procedures.

Multiple Imputation Method


The multiple imputation method is a procedure which uses multiple sets of imputed item param-
eter values to estimate the standard errors of equating (Zhang & Zhao, 2019). It is known that,
under suitable regularity conditions, the estimators of the item parameters are asymptotically
multivariate normally distributed in large samples if the maximum likelihood estimation method
is used (Andersson & Wiberg, 2017; Mislevy & Sheehan, 1989; Mislevy et al., 1994; Yang et al.,
2012). For the IRT models, the limiting distribution of the maximum likelihood estimators of the
item parameters (^c ) would follow a multivariate normal distribution with the mean vector, c0 ,
and variance-covariance matrix, acovðc0 Þ,
pffiffiffiffi
Nð^c  c0 ÞddMVNð0, acovðc0 ÞÞ (5)
where c0 and acovðc0 Þ represent the vector of the unknown true values of the item parameters
and the inverse of the Fisher information matrix, respectively, and MVN denotes multivariate
normal distribution. In large samples, the true distribution of the item parameters can be reason-
ably approximated by MVNð^c , acovð^c ÞÞ, where ^c and acovð^c Þ are the maximum likelihood esti-
mates of the item parameters and the associated variance-covariance matrix, respectively (Mislevy
et al., 1994; Thissen & Wainer, 1990; Yang et al., 2012). That is, provided that ^c and acovð^c Þ
were known, the true distribution of the item parameters can be reasonably well approximated by
randomly drawing multiple sets of plausible values from MVNð^c , acovð^c ÞÞ (Mislevy et al., 1994;
Thissen & Wainer, 1990; Yang et al., 2012). These imputed item parameter values can be used to
examine the impacts of the uncertainty of item parameters carried over from the item calibration
process on the estimates or statistics that are derived based on the item parameter estimates (e.g.,
Yang et al., 2012; Raju et al., 2009).
Given that the derivations of the IRT parameter scale transformation coefficients and the true
score equating coefficients in equating are based on the estimated item parameters, the random
imputations of the item parameters can also be used to estimate the standard errors for the
equating coefficients. Zhang and Zhao (2019) have shown how the imputed item parameters were
THE JOURNAL OF EXPERIMENTAL EDUCATION 5

Figure 1. The multiple imputation method for estimating the standard errors for the IRT true score equating coefficients.

used to obtain the standard errors for the two IRT parameter scale transformation coefficients for
the 2PL IRT model in the context of CINEG equating design.
The application of the multiple imputation method to obtain the standard errors for the IRT
true score equating coefficients mainly consists of five steps, which are graphically depicted in
Figure 1. To better illustrate the application of this method, a hypothetical example was intro-
duced. Let’s suppose that we have two test forms, Form U and Form V, each of which is com-
posed of five items. Items 1, 2, and 3 in each form are the common items shared between the
two tests. The sample sizes for the two test forms are both 5000. The true scores of Form V are
supposed to be equated to the scale of Form U. Although this hypothetical example is somewhat
unrealistic, it allows us to easily present the values of the estimated item parameters, the vari-
ance-covariance matrix of the item parameter estimates, examples of the random imputations of
the item parameters, and the derived IRT true score equating coefficients, which facilitates the
illustration of the different steps of the application of this method. The steps for applying the
multiple imputation method to obtain the standard errors for the IRT true score equating coeffi-
cients are described below.
Step 1. The first step of the multiple imputation method involves running separate IRT calibra-
tions on the original data for Form U and Form V to obtain the item parameter estimates (^c )
and the associated variance-covariance matrix (acovð^c Þ) for the items in the two test forms. The
6 Z. ZHANG

Table 1. Item parameter estimates and the associated variance-covariance matrix for the items in form U in the hypothet-
ical example.
c Þ)
Variance-covariance matrix (acovð^
Item 1 Item 2 Item 3 Item 4 Item 5
Estimate
Item Parameter (^c ) a b c a b c a b c a b c a b c
Item 1 a 1.540 .044
b 1.881 .033 .033
c .188 .003 .008 .006
Item 2 a .929 .003 .002 .000 .014
b .932 .000 .000 .000 .018 .044
c .172 .001 .000 .000 .004 .013 .005
Item 3 a 1.272 .003 .002 .000 .006 .005 .000 .041
b .083 .003 .002 .000 .000 .000 .000 .014 .014
c .138 .001 .001 .000 .000 .000 .000 .006 .005 .002
Item 4 a .664 .000 .000 .000 .000 .000 .000 .005 .001 .000 .020
b .966 .007 .004 .000 .001 .000 .000 .002 .001 .000 .016 .068
c .178 .001 .001 .000 .000 .000 .000 .001 .000 .000 .007 .015 .004
Item 5 a 1.773 .002 .003 .000 .001 .002 .000 .010 .005 .001 .001 .003 .000 .372
b 2.110 .001 .001 .000 .001 .001 .000 .001 .000 .000 .002 .000 .000 .114 .065
c .319 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .008 .000 .000

Table 2. Item parameter estimates and the associated variance-covariance matrix for the items in form V in the hypothet-
ical example.
c Þ)
Variance-covariance matrix (acovð^
Item 1 Item 2 Item 3 Item 4 Item 5
Estimate
Item Parameter (^c ) a b c a b c a b c a b c a b c
Item 1 a 1.337 .031
b 2.013 .032 .044
c .196 .003 .010 .006
Item 2 a 1.047 .003 .002 .000 .016
b 1.258 .001 .000 .000 .021 .046
c .190 .001 .001 .000 .004 .014 .006
Item 3 a 1.309 .000 .000 .000 .004 .004 .000 .062
b .101 .004 .004 .000 .001 .001 .000 .029 .026
c .175 .002 .001 .000 .000 .000 .000 .012 .010 .004
Item 4 a .733 .001 .000 .000 .001 .001 .000 .004 .001 .000 .027
b 1.249 .003 .003 .000 .002 .002 .000 .004 .003 .001 .010 .046
c .180 .001 .000 .000 .000 .000 .000 .000 .000 .000 .008 .009 .003
Item 5 a 1.900 .002 .005 .000 .000 .003 .000 .009 .002 .000 .005 .006 .001 .254
b 1.460 .000 .000 .000 .000 .000 .000 .007 .003 .001 .002 .000 .000 .022 .014
c .293 .000 .000 .000 .000 .000 .000 .001 .000 .000 .000 .000 .000 .009 .001 .001

values of ^c and acovð^c Þ for the items in Form U and Form V in the hypothetical example are
presented in Tables 1 and 2, respectively. For example, the item discrimination (a), difficulty (b),
and guessing (c) parameters for the first item in Form U are 1.540, 1.881, and 0.188,
respectively. The standard errors for the item parameter estimates are the positive square roots
of the diagonal elements of the variance-covariance matrix in the tables. It is notable that the off-
diagonal elements in both of the two variance-covariance matrices in this hypothetical example,
which represent the covariances between the estimates of the item parameters, are not all zeros.
This suggests that that the parameter uncertainty should be accounted for by considering these
covariances (Yang et al., 2012).
Step 2. In the second step, a set of plausible values for the item parameters ( c) for each of
the two test forms are randomly drawn from a multivariate normal distribution with the mean
vector ^c and the variance-covariance matrix acovð^c Þ (i.e., MVNð^c , acovð^c ÞÞ) both of which
for each test form are obtained in the first step. In the hypothetical example, the first set of
THE JOURNAL OF EXPERIMENTAL EDUCATION 7

Table 3. Random imputations of the item parameters for the items in form U in the hypothetical example.
Item Parameter  c1  c2  c3 …  c 1000
Item 1 a 1.326 1.655 1.472 … 1.356
b 2.168 1.790 1.738 … 2.004
c .113 .203 .300 … .160
Item 2 a .954 .851 .924 … .969
b .930 .922 .681 … 1.143
c .160 .167 .295 … .052
Item 3 a 1.253 1.399 1.648 … 1.286
b .158 .189 .046 … .134
c .168 .181 .135 … .138
Item 4 a .689 .563 .488 … .701
b 1.163 1.142 .883 … 1.287
c .216 .180 .114 … .247
Item 5 a 2.210 1.326 1.324 … 1.384
b 2.282 1.981 2.490 … 1.962
c .351 .298 .318 … .292

Table 4. Random imputations of the item parameters for the items in form V in the hypothetical example.
Item Parameter  c1  c2  c3 …  c 1000
Item 1 a 1.058 1.217 1.432 … 1.441
b 2.400 2.124 1.879 … 1.706
c .171 .137 .244 … .333
Item 2 a .910 .785 .893 … .916
b 1.366 1.410 1.592 … 1.328
c .190 .225 .088 … .178
Item 3 a 1.555 1.138 1.071 … 1.099
b .155 .181 .137 … .188
c .206 .072 .093 … .051
Item 4 a .808 .654 .885 … .459
b .977 1.241 1.203 … .867
c .168 .155 .200 … .068
Item 5 a 1.005 1.507 1.898 … 2.567
b 1.500 1.523 1.545 … 1.329
c .268 .280 .299 … .303

the imputed values for the item parameters for Form U and Form V are presented in the third
column ( c 1 ) of Tables 3 and 4, respectively. For example, as shown in Table 3, the first set
of randomly imputed values for the item discrimination (a), difficulty (b), and guessing (c)
parameters for the first item in Form U are 1.326, 2.168, and 0.113, respectively.
Step 3. In this step, based on the imputed item parameter values ( c) for the common items
in the two test forms obtained at Step 2, IRT parameter scale transformation coefficients (i.e., A
and B) are firstly estimated using the moment methods and/or the characteristic curve methods.
The estimated A and B are then applied to transform the imputed item parameters of Form V to
the scale of Form U. After placing the imputed item parameters from the two test forms onto the
same metric, IRT true score equating is performed to obtain the Form V true score equivalents
of the form U true scores. In this hypothetical example, the estimated linking coefficients for A
and B calculated using the Haebara approach on the first set of imputed item parameter values
( c 1 ) for the two test forms are 1.211 and 0.184, respectively. The Form V true score equivalents
g ) of the Form U true scores (n) that were estimated based on the first set of imputed item
(^
parameters are presented in the second column of Table 5. For example, the estimated Form V
true score equivalent of a Form U true score of 2 is 2.121.
Step 4. Steps 2 and 3 are repeated a considerable number of times (e.g., 1000). In the
hypothetical example, the 2nd ( c 2 ), 3rd ( c 3 ), and 1000th ð c 1000 ) sets of imputed item par-
ameter values for Form U and Form V are shown in Tables 3 and 4, respectively. The equated
IRT true scores that were derived based on the three sets of imputed item parameters using the
8 Z. ZHANG

Table 5. Form V true score equivalents (^


g ) of from U true scores (n) estimated using the imputed item parameter values.
Form U true score n ^
g ( c 1 ) ^
g ( c 2 ) ^
g ( c 3 ) … ^
g ( c 1000 ) Standard deviation
1 – – – … 1.156 .142
2 2.121 2.034 1.974 … 2.102 .083
3 2.966 2.963 2.938 … 2.971 .037
4 4.014 3.961 4.043 … 3.899 .057

Haebara approach are presented in Table 5. For example, the estimated Form V true score equiv-
alents of a Form U true score of 3 that were obtained based on the 2nd ( c 2 ), 3rd ( c 3 ), and
1000th ð c 1000 ) sets of imputed item parameters are 2.963, 2.938, and 2.971, respectively.
Step 5. In the last step, the standard deviation of the Form V true score equivalents for each
Form U true score that are obtained based on the multiple sets of imputed item parameters at
Step 4 is calculated. These standard deviations are the estimated standard errors for the IRT true
score equating coefficients. In this hypothetical example, the standard deviations of the1000 sets
of estimated Form V true score equivalents for all the Form U true scores that were obtained
based on the 1000 sets of imputed item parameters are shown in the last column of Table 5. For
example, the standard errors for the Form V true score equivalents of the Form U true scores of
2 and 3 are 0.083 and 0.037, respectively.

Bootstrap Method
The bootstrap method is a resampling approach which can be applied to estimate the standard
error for an estimator when its standard error is mathematically intractable (Efron & Tibshirani,
1993). This method typically involves repeated analyses on a large number of bootstrap samples.
These bootstrap samples can be randomly drawn or generated from the original dataset in either
a nonparametric or parametric fashion. Only the nonparametric bootstrap method was considered
in the current study. For the nonparametric bootstrap method, multiple sets of bootstrap samples
are normally obtained by randomly drawing cases from the original data with replacement. The
statistics that are relevant to the estimator of interests can be derived based on these bootstrap
random samples. The bootstrap standard error of the estimator then can be calculated as the
standard deviation of the statistics (Patton et al., 2014).
The bootstrap method has been applied to estimate the standard errors for the IRT true score
equating coefficients (e.g., Kolen & Brennan, 2004; Tsai et al., 2001). Figure 2 shows the proced-
ure for applying the bootstrap method to obtain the standard errors for the IRT true score equat-
ing coefficients. As shown in the figure, the application of the bootstrap technique includes five
steps, which are described below.
Step 1. One bootstrap dataset is generated by randomly drawing cases with replacement from
the original data of Form U for Group 1. The number of cases of the bootstrap dataset is the
same as the sample size of the original data. Another bootstrap dataset is generated in the same
way from the original data of Form V for Group 2.
Step 2. IRT calibration is run separately on each of the two bootstrap datasets obtained at Step
1 to get the two sets of item parameter estimates (c).
Step 3. The two sets of item parameter estimates (c) for the common items obtained at Step 2
are used to estimate the two IRT parameter scale transformation coefficients (i.e., A and B) using
the moment methods and/or the characteristic curve methods. The estimates of A and B are used
to conduct linear transformations to rescale the item parameters of Form V obtained at Step 2 to
the scale of Form U. Based on the equated item parameters, IRT true score equating is performed
to obtain the Form V true score equivalents of the Form U true scores.
Step 4. Steps 1 to 3 are repeated a large number of times (e.g., 1000).
THE JOURNAL OF EXPERIMENTAL EDUCATION 9

Figure 2. The bootstrap method for estimating the standard errors for the IRT true score equating coefficients.

Step 5. The bootstrap standard errors for the IRT true score equating coefficients are obtained
by calculating the standard deviations of the Form V true score equivalents for the Form U true
scores that are obtained at Step 4.

Delta Method
The delta method is another popular approach to obtaining the standard errors for the estimates
of the IRT true score equating coefficients (Ogasawara, 2001a; Wong, 2015). It is an analytic
approach for deriving the mathematical formulas for the standard error expression to provide
good approximation to the standard error for a statistic that is a function of other statistics which
expressions for standard errors already exist (Kolen & Brennan, 2004). In IRT true score equat-
ing, the equated true scores are functions of the estimated item parameters (i.e., ^c ) as well as the
estimates of the IRT parameter scale transformation coefficients (i.e., A ^ and B).
^ Therefore, given
the availability of the variance-covariance matrices of ^c as well as the derived variance-covariance
matrix for ^c , A^ and B,
^ the delta method can be used to derive the mathematical formulas for
computing the standard
0 errors for the IRT true score equating coefficients.
0
Let b ¼ c , A, B be a vector which contains the list of item parameters for the items in Form
0
U and Form V (c ) as well as the IRT parameter scale transformation coefficients (A and B). By
using the delta method, the asymptotic variance of the estimated Form V true score
equivalent ð^g Þ of a given Form U true score (n) can be obtained as follows (Ogasawara, 2001a):
10 Z. ZHANG

@g ^ @g
avarð^g Þ ¼ 0 acovðbÞ (6)
@b @b
@g
where @b  g with respect to
is a vector whose elements are the partial derivatives of the true score
0
the item parameters (c ) and the two linking coefficients (A and B); and acov b ^ is the variance-
0
covariance matrix for the estimates of item parameters (^c ) and the two linking coefficients (A ^
^ The asymptotic standard error of ^g is the positive square root of avarð^g Þ:
and B).
Ogasawara (2001a) firstly applied the delta method to derive the formulas for calculating the
standard errors for the IRT true score equating coefficients for the 3PL IRT model in the context
of the CINEG equating design. Wong (2015) used the delta method to derive the formulas for
computing the standard errors for the true score equating coefficients that are calculated using
the moment methods for the polytomous IRT models including the generalized partial credit
model (GPCM, Muraki, 1992) and the graded response model (GRM, Samejima, 1969). Based on
the work of Ogasawara (2001b) and Wong (2015), Zhang (2019) recently reformulated a delta
approach to obtaining the standard errors for the IRT true score equating coefficients that are
estimated using the characteristic curve methods for the GPCM.

Comparison between the Multiple Imputation Method, the Bootstrap Method and
the Delta Method
To some extent, the multiple imputation method and the bootstrap method are similar because
both of them depend on multiple sets of replicated item parameters to estimate standard errors.
However, they differ in the way that the replicated item parameters are generated, which deter-
mines the advantage of the multiple imputation method over the bootstrap method in terms of
the computational efficiency. The bootstrap method gets the replicated item parameters through
running IRT calibrations on a large number of bootstrap samples. This significantly increases the
computational burden, leading the bootstrap method to be a time-consuming and computation-
ally intensive approach (Kolen & Brennan, 2004), especially when dealing with long tests, large
sample sizes, complex linking designs (e.g., equating among multiple forms), and/or complicated
IRT models in test equating. In contrast, for the multiple imputation method, the replicated item
parameters are the imputed item parameter values. These imputed item parameters can be dir-
ectly obtained through drawing plausible values from a multivariate normal distribution (i.e.,
MVNð^c , acovð^c Þ)). The mean vector and variance-covariance matrix of the distribution are the
estimates of the item parameters (^c ) and their variance-covariance matrix (acovð^c Þ) that are
derived from the IRT calibrations on the original data. This tremendously reduces the computa-
tional burden, which makes the multiple imputation method more attractive than the bootstrap
method in terms of the computational efficiency.
The advantage of the delta method is that the computation time can be minimized provided
that the required mathematical formulas for the standard error expressions were developed.
Indeed, compared to the bootstrap method and the multiple imputation method, the application
of the delta method seems simpler as it doesn’t depend on repeated equating analyses on a large
number of replicated item parameters. It just needs a single IRT calibration for each of the tests
to be equated to obtain the item parameter estimates and the associated variance-covariance
matrix which are subsequently used in the derived equations for computing the standard errors
for the IRT true score equating coefficients. However, deriving the complicated formulas for the
expressions for the asymptotic standard errors could become very complicated or intractable
when the equating design, IRT models, and/or equating methods are complicated (Kolen &
Brennan, 2004; Zhang & Zhao, 2019). In addition, when any of these factors relevant to equating
(e.g., IRT models, equating methods, and/or equating design) change, specific mathematical
standard error expressions must be reformulated. For example, the formulas derived by the delta
THE JOURNAL OF EXPERIMENTAL EDUCATION 11

method for the standard errors of the true score equating coefficients for the GPCM could not be
directly used to estimate the standard errors for the GRM (Wong, 2015). In contrast, the multiple
imputation method is not constrained to these limitations because its application does not reply
on the derivations of such complicated mathematical formulas.

Simulation Study
The simulated data were used to evaluate and compare the performance of the multiple imput-
ation method with that of the bootstrap and delta methods in obtaining the standard errors for
the IRT true score equating coefficients in test equating for the 3PL IRT model under vari-
ous conditions.

Method
This simulation study, as stated above, assumed that two tests, Form U and Form V, which
shared a set of common items, were administered to two independent and nonequivalent groups,
Group 1 and Group 2, respectively. The IRT true scores of Form V were assumed to be equated
to the scale of Form U. The generating item parameters were the item parameter estimates
obtained from the real data for two mathematics tests. Both test forms were comprised of 36 mul-
tiple choice items each of which consisted of 4 response options. There were 18 common items
between the two tests. The manipulated factors for generating data included the number of com-
mon items and the sample size. Both the two factors have been identified as important factors
affecting the results of IRT equating (Battauz, 2015a; Kolen & Brennan, 2004). These two factors
have been considered in many simulation studies of test equating (e.g., Andersson, 2018; Hanson
& Beguin, 2002, Kim, 2006; Kim & Cohen, 1998, 2002; Ogasawara, 2000, 2001a, 2001b; Zhang &
Zhao, 2019). Generally, the results of previous simulation studies suggested that larger sample
size and larger number of common items lead to more accurate equating coefficients and less
random equating errors (Battauz, 2015a; Kolen & Brennan, 2004). There were two levels for the
number of common items in the study: 12 (approximately 33.3%) and 18 (50%). For simulating
the data with 12 common items, twelve items were randomly selected from the total of the 18
common items. The remaining six common items in each test were taken as unique items. The
first level (12 common items) was considered to follow the rule of thumb which suggested that a
common-item set should be at least 20% of total number of items in a test in test equating (Cook
& Eignor, 1991; Kolen & Brennan, 2004). The second level (18 common items) simulated a con-
dition with a larger number of common items in which the ratio of the number of common
items could be up to 50% of the length of a total test (e.g., Kaskowitz & De Ayala, 2001; Kim &
Cohen, 1998; Ogasawara, 2000; Zhang & Zhao, 2019). It is important to consider the condition
with large number of common items because having a larger number of common items has a
substantial effect on the variability of equating coefficients when the sample size is small (Battauz,
2015a). Two levels of sample size were considered: 1000 and 3000. Both the two levels have been
considered in the study of Hanson and Beguin (2002) which compared the performances of the
separate and concurrent item parameter estimation procedures in IRT equating under the 3PL
IRT model. The first level was chosen because a sample size of 1000 or more is strongly recom-
mended for mitigating the convergence problems in model calibration and obtaining reasonably
accurate item parameter estimates with the 3PL IRT model (De Ayala, 2013). The second level
with a larger sample size of 3000 examinees was chosen based on the usual practice that a rela-
tively large sample is needed to obtain stable estimates for the parameters of items in test equat-
ing (Kim, 2006; Kim et al., 2005). The generating ability parameters for the examinees in Group
1 for Form U were randomly drawn from a standard normal distribution (i:e:, h1  Nð0, 1Þ).
The ability parameters for the examinees in Group 2 for Form V were generated by Nð0:5, 1:22 Þ
12 Z. ZHANG

(i.e., h2  Nð0:5, 1:22 Þ) for simulating a condition in which the samples administered the two test
forms differed significantly in ability (i.e., nonequivalent groups). This distribution has been
extensively used in previous simulation studies of test equating (e.g., Andersson, 2018; Kim, 2006;
Ogasawara, 2000, 2001a, 2001b; Wong, 2015; Zhang & Zhao, 2019). A combination of differences
in the levels and the factors resulted in a total of four simulated conditions (two levels for the
number of common items  two levels for the sample size). All the item response matrices were
generated with the 3PL IRT model. Each of the four simulated conditions was replicated
100 times.
The simulated response data were calibrated with the 3PL IRT model using the marginal max-
imum likelihood (MML, Bock & Aitkin, 1981; Bock & Lieberman, 1970) estimation method. All
the model calibrations were conducted with the computer software FlexMIRT 3.0 (Cai, 2015).
The convergence criteria were the same as those used in the studies of Paek and Cai (2014) and
Zhang and Zhao (2019). More specifically, the maximum allowed numbers of E-steps and M-
steps were 1000 and 500, respectively. The E-step, the M-step, and the supplemented expectation
maximization (SEM) convergence tolerances were 106 , 109 , and 103 , respectively. Given that
the generating item parameters were obtained from the model calibrations on the real data in
which each item had four response options, the prior distribution imposed on the parameters
relating to guessing (or more accurately on the logit of the item guessing parameter) for all the
items was Nð1:09, 0:5Þ (Cai, 2015). The SEM procedure, which has been shown to be not only
practically feasible but also less affected by small sample size than other available procedures such
as the empirical cross-product (Paek & Cai, 2014; Zhang & Zhao, 2019), was used to obtain the
variance-covariance matrices for the item parameter estimates which were required by both the
delta method and the multiple imputation method.
The IRT test equating was conducted using the R programing language (R Development Core
Team, 2016) with the package equateIRT (Battauz, 2015b). The parameter scale transformation
coefficients as well as the true score equating coefficients were estimated using the two moment
methods and the two characteristic curve methods, respectively. The standard errors for the
equating coefficients were obtained using the three procedures discussed above. The delta method
was directly applied with the package equateIRT. The R programing language was used to write
code to implement the bootstrap method and the multiple imputation method. The code could
be made available upon request. For the bootstrap method, 1000 bootstrap samples were gener-
ated in each replication for estimating the standard errors. For the multiple imputation method,
1000 sets of imputed item parameters were generated for computing the standard errors in every
replication.
Because the true standard errors for the estimates of A and B as well as the equated true scores
are unknown, the empirical standard errors, which were the empirical standard deviations of the
estimated equating coefficients calculated over 10000 replications (Paek & Cai, 2014; Zhang, 2019;
Zhang & Zhao, 2019), were used as the criterion standard errors in the study.

Results
The empirical standard errors as well as the means and standard deviations of the estimated
standard errors for the estimates of A and B are summarized in Table 6. Generally, for the par-
ameter scale transformation coefficients A and B that were derived using the two characteristic
curve methods, the standard errors produced by the multiple imputation method are very close
or almost identical to the criterion empirical standard errors as well as those yielded by the boot-
strap method and the delta method under all the four simulated conditions. The multiple imput-
ation method and the delta method produced extremely close standard errors for the parameter
scale transformation coefficients that were calculated using the Mean/Mean approach. Both the
two methods tended to yield slightly larger standard errors than the criterion standard errors,
Table 6. Descriptive statistics of the standard errors for the estimates of the IRT parameter scale transformation coefficients calculated over the 100 replications for the simulated data.
A B
MM MS HA SL MM MS HA SL
N CI Approach Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD
1000 12 Empirical .086 – .086 – .063 – .068 – .065 – .068 – .058 – .058 –
Bootstrap .102 .020 .090 .010 .064 .004 .068 .004 .066 .004 .069 .004 .058 .003 .058 .003
Delta .091 .007 .099 .009 .062 .004 .068 .004 .071 .003 .072 .003 .058 .002 .059 .002
MI .091 .007 .107 .010 .064 .004 .070 .005 .071 .003 .072 .004 .059 .002 .060 .002
18 Empirical .074 – .069 – .055 – .059 – .060 – .062 – .055 – .056 –
Bootstrap .104 .030 .072 .005 .056 .003 .060 .004 .062 .004 .063 .002 .056 .002 .057 .002
Delta .079 .006 .079 .005 .055 .003 .059 .003 .065 .002 .065 .002 .056 .001 .056 .001
MI .080 .006 .082 .005 .055 .003 .060 .003 .065 .003 .065 .002 .056 .002 .057 .002
3000 12 Empirical .051 – .057 – .035 – .038 – .041 – .042 – .034 – .035 –
Bootstrap .052 .003 .058 .005 .036 .001 .039 .002 .041 .002 .042 .002 .034 .001 .034 .001
Delta .053 .002 .067 .005 .035 .001 .039 .001 .045 .002 .044 .002 .034 .001 .034 .001
MI .053 .002 .072 .005 .036 .001 .039 .001 .045 .002 .044 .002 .033 .001 .034 .001
18 Empirical .044 – .046 – .032 – .034 – .038 – .038 – .033 – .033 –
Bootstrap .045 .002 .046 .003 .032 .001 .034 .001 .038 .001 .038 .001 .032 .001 .033 .001
Delta .046 .002 .053 .002 .031 .001 .034 .001 .040 .001 .040 .001 .032 .000 .032 .001
MI .046 .002 .055 .003 .031 .001 .034 .001 .040 .001 .040 .001 .032 .001 .033 .001
Note. CI ¼ number of common items; Empirical ¼ empirical standard errors; Delta ¼ delta method; Bootstrap ¼ bootstrap method; MI ¼ multiple imputation method; MM ¼ Mean/Mean
approach; MS ¼ Mean/Sigma approach; HA ¼ Haebara approach; SL ¼ Stocking-Lord approach.
THE JOURNAL OF EXPERIMENTAL EDUCATION
13
14 Z. ZHANG

Figure 3. Standard errors for the equated IRT true scores for the simulated data.

particularly when the sample size and/or the number of common items are/is small. The multiple
imputation method tended to yield slightly larger standard errors than the delta method for the
estimates of the linking coefficient A that were calculated using the Mean/Sigma approach. But
the differences decrease with the increase of the sample size and/or the number of common
items. In contrast, for the estimates of the parameter scale transformation coefficient B that were
calculated using the Mean/Sigma approach, the standard errors produced by the multiple
imputation method and those estimated by the delta method are almost identical under all the
simulated conditions. In addition, the standard errors for the two linking coefficients estimated
using the two characteristic curve methods were smaller than those for the coefficients calculated
using the two moments methods, which is consistent across all the simulated conditions.
The results for the standard errors for the IRT true score equating coefficients are shown
in Figure 3. Generally, under all the simulated conditions where the characteristic curve methods
were used, the curves for the standard errors for most of the equated scores produced by the
multiple imputation method were coincident with the curves of the criterion empirical standard
errors as well as the curves for the standard errors yielded by the bootstrap method and the delta
method. This indicates that the standard errors for almost all of the equated scores obtained
using the multiple imputation method were extremely close to those derived by the bootstrap
method and the delta method as well as the criterion empirical standard errors when the IRT
true score equating was conducted using the two characteristic curve methods. It is noted that,
compared with the bootstrap method, both the multiple imputation method and the delta method
yielded slightly larger standard errors for the estimates of the Form V true score equivalents
of the very low Form U true scores (e.g., 7, 8). But when the sample size was large, the
differences became smaller and were nearly negligible. The multiple imputation method also
performed nearly in the same way as the delta method in producing standard errors for the
THE JOURNAL OF EXPERIMENTAL EDUCATION 15

Table 7. Estimates and standard errors of the IRT parameter scale transformation coefficients for the real data.
Standard error
Estimate Bootstrap Delta MI
Mean/Mean
A 1.191 .043 .044 .044
B .154 .038 .046 .046
Mean/Sigma
A 1.101 .037 .040 .043
B .151 .036 .044 .045
Haebara
A 1.113 .026 .026 .026
B .110 .026 .027 .027
Stocking-Lord
A 1.157 .027 .027 .027
B .107 .027 .028 .027
Note. Delta ¼ delta method; Bootstrap ¼ bootstrap method; MI ¼ multiple imputation method.

equated true scores that were estimated using the moment methods. The bootstrap method
appeared to outperform the multiple imputation method and the delta method in obtaining the
standard errors for the equated scores that were estimated using the moment methods, especially
when the sample size was small. The multiple imputation method and the delta method tended
to yield larger standard errors than the bootstrap method for the estimates of the Form V true
score equivalents of the low to mid-range Form U true scores obtained using the moment meth-
ods. In addition, the standard errors for all the equated IRT true scores estimated using the two
characteristic curve methods were smaller than those for the coefficients derived from the two
moment methods under each of the simulated conditions, irrespective of the approaches being
used for estimating the standard errors.

Empirical Illustration
Method
The empirical data came from two mathematics problem solving tests: Test 1 and Test 2. Both the
two tests consisted of 35 dichotomously scored multiple-choice items each of which was designed
with four response options. Twelve items which were common between Test 1 and Test 2 were
used as common items in test equating. The number of examinees for Test 1 and Test 2 data were
5384 and 4148, respectively. All the analyses relevant to IRT model calibrations and test equating
were conducted in the same way as in the simulation study. More specifically, the item parameters
were estimated with the 3PL IRT model using the MML estimation method (Bock & Aitkin, 1981;
Bock & Lieberman, 1970). The model calibrations were performed with the computer software
FlexMIRT 3.0. The convergence criteria as well as the prior distribution imposed on the guessing
parameters for the items were the same as those specified in the simulation study. IRT equating
was conducted to place the item parameters as well as the true scores of Test 2 onto their respective
scales of Test 1. The estimates of A and B were obtained by using the two moment methods and
the two characteristic curve methods, which were all performed using the R programing language
with the package equateIRT. The standard errors for the estimates of A and B and the equated true
scores were estimated using the bootstrap method, the delta method, and the multiple imputation
method, which were all conducted using the R programing language.

Results
The estimated IRT parameter scale transformation coefficients as well as the associated standard
errors obtained using the three standard error estimation procedures are presented in Table 7. The
16 Z. ZHANG

results shown in the table are generally in line with the results of the simulated data. The standard
errors produced by the multiple imputation method were extremely close to those obtained using
the bootstrap method and the delta method for the estimates of A and B that were estimated using
the characteristic curve methods. For the estimates of A and B that were calculated using the
moment methods, the multiple imputation method and the delta method generally yielded very
close standard errors. Both the two methods produced slightly larger standard errors than the boot-
strap method. Consistently across all the standard error estimation procedures, the standard errors
for the two linking coefficients calculated using the moment methods were significantly greater
than those for the coefficients estimated using the two characteristic curve methods.
Figure 4 shows the curves for the standard errors for the estimates of the IRT true score equat-
ing coefficients. In accordance with the results of the simulation study, for the estimates of IRT
true equating coefficients estimated using the characteristic curve methods, the curves for the stand-
ard errors produced by the multiple imputation method were almost coincident with those for the
standard errors obtained using the bootstrap method and the delta method. For the estimates of
the equated scores that were estimated using the moment methods, the results were also consistent
with simulation study. The multiple imputation method and the delta method produced very
closely matched standard errors. Both of the two methods yielded larger standard errors than the
bootstrap method for the low to mid-range range equated true scores. In line with the results of
the simulation study, the standard errors for all the true score equating coefficients obtained based
on the two moment methods were larger than those for the coefficients estimated using the two
characteristic curve methods, regardless of the standard error estimation procedures being used.

Discussion
The overarching goal of the study is to examine whether the multiple imputation method can be
applied as a viable alternative to the bootstrap and delta methods to obtain the standard errors
for the IRT true score equating coefficients in test equating under the context of the CINEG
equating design for the 3PL IRT model. The multiple imputation method makes use of the
imputed item parameter values to estimate the standard errors of equating coefficients (Zhang &
Zhao, 2019). These imputed item parameters are the random imputations of the item parameters
which can be obtained by randomly drawing plausible values from a multivariate normal distribu-
tion. The mean vector and variance-covariance matrix of the distribution are the item parameter
estimates and their variance-covariance matrix that are obtained from IRT calibration on the ori-
ginal data. Given that the IRT true score equating coefficients are functions of the item parame-
ters, the multiple imputation method, as illustrated in the hypothetical example, can also be taken
as an approach to obtaining the standard errors for the equated IRT true scores. In addition, it is
hypothesized that the multiple imputation method which was originally introduced with the 2PL
IRT model can also be used to determine the standard errors of equating coefficients for other
IRT models (e.g., the 3PL IRT model in the study) if the maximum likelihood estimation method
is used to estimate the item parameters. In the current study, both the data simulated with differ-
ent levels of sample size and number of common items and the real data were used to examine
and compare the performance of the multiple imputation method with that of the bootstrap
method and the delta method in producing the standard errors for the true score equating coeffi-
cients under the 3PL IRT model. The results suggested that the multiple imputation method per-
formed equally as well as the bootstrap method and the delta method in yielding the standard
errors for the IRT true score equating coefficients estimated using the two characteristic curve
methods. When the moment methods were used, the multiple imputation method and the delta
method, which produced extremely close standard errors for the IRT true score equating coeffi-
cients, tended to yield slightly larger standard errors than the bootstrap method for the low to
mid-range equated true scores. In summary, the multiple imputation method performed as
THE JOURNAL OF EXPERIMENTAL EDUCATION 17

Figure 4. Standard errors for the equated IRT true scores for the real data.

effectively as the delta method under all the simulation conditions, irrespective of the test equating
methods. The multiple imputation method can be taken as a practically viable alternative to both
the bootstrap method and the delta method when the characteristic curve methods are used in test
equating. When the moment methods are used in test equating with a small sample size, the boot-
strap method slightly outperformed the multiple imputation method and the delta method.
The results of the simulation study further confirm the effects of the sample size and the num-
ber of common items on the estimation of the standard errors. Generally, the results indicated that
increasing the sample size and/or the number of common items could reduce the standard errors
of equating, regardless of the standard error procedures being used. The estimated standard errors
for the estimates of equating coefficients that were derived using the moments methods were uni-
formly larger than those for the estimates that were obtained using the two characteristic curve
methods under each of the manipulation conditions, irrespective of the approaches being used for
obtaining the standard error. All these results are consistent with the findings of previous studies
(Andersson, 2018; Battauz, 2015a; Ogasawara, 2000, 2001a, 2001b; Wong, 2015; Zhang &
Zhao, 2019).
The application of the multiple imputation method to obtain the standard errors for the IRT
true score equating coefficients has important practical implications. It facilitates the reporting of
18 Z. ZHANG

standard errors in practice by providing a feasible alternative approach to the bootstrap method,
especially in test equating which involves complicated linking design, IRT models, and equating
methods. The multiple imputation method is more attractive than the bootstrap method in terms of
the computational efficiency. As introduced above, the bootstrap method relies on running a consid-
erable number of model calibrations on the bootstrap data to obtain multiple sets of item parameters
that are used for estimating the standard errors of IRT true score equating coefficients. This will
make the bootstrap method very computationally intensive or even infeasible in some cases where
involve long tests, large sample sizes, complex linking designs, and complicated IRT models all of
which could significantly increase the burden of model calibrations. In contrast, the multiple imput-
ation method is more computationally efficient because it only needs a single IRT calibration on the
original data to obtain the item parameter estimates and the associated variance-covariance matrix for
each of the tests to be equated. For example, in the empirical illustration of the current study, the
computation time of using the bootstrap method to obtain the standard errors for the true score
equating coefficients was approximately three hours. Comparatively, the computation time for the
multiple imputation method was only about eight minutes on the same computer. The bootstrap
method was nearly twenty times slower than the multiple imputation method.
The multiple imputation method can be taken as a more practically feasible option in test
equating when the delta method is not developed. The delta method seems more appealing than
the multiple imputation method in terms of the computational efficiency. However, the applica-
tion of the delta method depends on the derivations of complicated mathematical formulas,
which could become very complicated or even intractable when the linking design, IRT models,
equating methods or other factors relating to equating are complicated (Kolen & Brennan, 2004;
Zhang & Zhao, 2019). The multiple imputation method, which was shown to perform as effect-
ively as the delta method in the study, does not rest on the derivations of formulas. Therefore, in
some circumstances where the delta methods are not readily developed, the multiple imputation
method can be considered as an alternative. These circumstances are normally related to the
application of complicated IRT models, complex linking designs, and sophisticated equating
methods in test equating. First, the application of the multiple imputation method can facilitate
the reporting of standard errors of IRT true score equating involving polytomous IRT models
and/or mixed-format tests (e.g., mixed dichotomous and polytomous IRT models, Kim & Lee,
2006) where the delta approaches for standard errors have not completely developed. Despite
being tested with the 3PL IRT model in the current study, the multiple imputation method in
theory can also be applied in test equating for the polytomous models or the mixed dichotomous
and polytomous IRT models if the maximum likelihood estimation method is used for item cali-
brations (e.g., Bock & Aitkin, 1981). For example, the delta methods for obtaining the standard
errors for the IRT true score equating coefficients estimated using the characteristic curve meth-
ods in test equating under some of the polytomous IRT models (e.g., GPCM, GRM) have not
been developed (Wong, 2015). In this case, the multiple imputation method could be taken as an
approach without additional effort, facilitating the reporting of standard errors in test equating
involving the polytomous IRT models and/or the mixed-format tests. Second, the multiple imput-
ation method supports researchers and practitioners to use more robust equating methods (e.g.,
characteristic curve methods), in contrast to the development of the delta method, which nor-
mally requires extra efforts in test equating practices. The derivations of the mathematical formu-
las using the delta method for calculating the standard errors for the IRT equating coefficients
estimated using the moment methods are very straightforward because these equating coefficients
are explicit functions of the item parameters (Ogasawara, 2000; Wong, 2015). In contrast, the
equating coefficients derived from the characteristic curve methods are implicit functions of the
item parameters, which complicates the derivations of the mathematical formulas using the delta
method for estimating the standard errors of equating (Andersson, 2018; Ogasawara, 2001a,
2001b; Zhang, 2019). This may be one of the reasons why the delta method was only applied to
THE JOURNAL OF EXPERIMENTAL EDUCATION 19

derive the standard error expressions for the IRT true score equating coefficients that were esti-
mated using the moment methods for the polytomous IRT models (Wong, 2015). However, the
characteristic curve methods are preferable to the moment methods in test equating because they tend
to produce more accurate and stable equating results (Baker & Al-Karni, 1991; Hanson & Beguin,
2002; Kim & Cohen, 1998; Kim & Kolen, 2007; Kim & Lee, 2006; Kolen & Brennan, 2004; Ogasawara,
2001a, 2001b). Therefore, the application of the multiple imputation method will encourage researcher
and practitioners to select the more robust characteristic curve methods in their testing practices even
though the relevant delta methods for producing the standard errors are unavailable. Finally, provided
that the relevant delta method is not available, the multiple imputation method can be undertaken,
not as an efficient ideal, but as a practically feasible approach to obtaining the standard errors of IRT
true score equating coefficients in test equating with complex linking design (e.g., chain equating
design involving multiple test forms, Kolen & Brennan, 2004). Complex linking design involving mul-
tiple test forms is common in large-scale assessment and longitudinal studies. Utilizing the delta
method to derive the standard error expressions for calculating the standard errors for the IRT true
score equating coefficients in such complex linking design can be very complicated (e.g., Battauz,
2013), particularly when involving complex IRT models (e.g., polytomous IRT models or mixed
dichotomous and polytomous IRT models) and sophisticated equating methods (e.g., characteristic
curve methods). Although the application of the multiple imputation method in such complex linking
designs will also be computationally intensive and time-consuming to some extent, it is still a better
alternative than the bootstrap method for facilitating the reporting of standard errors when the pro-
cedure based on the delta method is unavailable.
The applications of both the multiple imputation method and the delta method illustrated in the
current study could rely on the availability of the variance-covariance matrix for the estimates of
the item parameters under the usual IRT parameterization. More specifically, given that the item
discrimination (a), difficulty (b), and guessing (c) parameters are used in test equating to obtain the
IRT parameter scale transformation and true score equating coefficients for the 3PL IRT model, the
variance-covariance matrices for the estimates of a, b, and c are required by both the multiple
imputation method and the delta method for estimating the standard errors for the true score
equating coefficients. However, to our knowledge, most of the currently available commercial and
open-sourced computer programs for the 3PL IRT model (e.g., FlexMIRT 3.0, Cai, 2015; R package
mirt, Chalmers, 2012; R package ltm, Rizopoulos, 2006), for ease of item estimation, use the IRT
model with the slope-intercept parametrization rather than the usual IRT model. Despite producing
the estimates for the item parameters under the usual IRT parameterization, they normally do not
provide the variance-covariance matrix for the estimates of these item parameters. One of the rea-
sons might be that the variance-covariance matrix for these item parameters under the usual IRT
parameterization was not frequently used before. Therefore, the variance-covariance matrices for
the item parameters under the slope-intercept parameterization should be converted to that for the
item parameters under the usual IRT parameterization so that they can be used with the multiple
imputation method and the delta method for obtaining standard errors. This normally can be
achieved by using the delta method. For example, in the current study, the computer program
FlexMIRT 3.0 (Cai, 2015), which uses the 3PL IRT model with the slope-intercept parameterization,
produces variance-covariance matrix for the estimates of the slope parameters (a), the intercept
parameters (d), and the logit of the guessing parameters (g). The slope parameters (a) are exactly
the same as the item discrimination parameters (a) under the usual IRT parameterization. The
intercept parameters (d) represent the negative values of the products of the item discrimination
parameters (a) and the item difficulty parameters (b) under the usual IRT parameterization (i.e.,
d ¼ ab). The relationship between the parameters g and c can be expressed as
 
c
g ¼ log :
1c
20 Z. ZHANG

0
Let ða, b, cÞ be the vector of the item discrimination, difficulty, and guessing parameters for all
0
the items under the usual IRT parameterization and ða, d, g Þ be the vector of item parameters
under the slope-intercept parameterization employed by the computer program FlexMIRT 3.0. By
using the delta method, the asymptotic variance-covariance matrix for the estimates of the item
discrimination, difficulty, and guessing parameters (i.e., acovð^ ^ ^c Þ) can be obtained as fol-
a , b,
lows.
" # " #0
 
ð
^ @ ða, b, cÞ ^ @ a, b, cÞ
acov a ^ , b, ^c ¼ acov a^ , d, g^
@ ða, d, g Þ @ ða , d , g Þ

where,

@ai 1, if i ¼ j
¼
@aj 0, if i 6¼ j
@ai
¼0
@dj
@ai
¼0
@gj
8
@bi < 2i , if i ¼ j
d
¼ ai
@aj :
0, if i 6¼ j
8
@bi <  , if i ¼ j
1
¼ ai
@dj : 0, if i 6¼ j

@bi
¼0
@gj
@ci
¼0
@aj
@ci
¼0
@dj
8
exp ðgi Þ
@ci < , if i ¼ j
¼ ½ð1 þ exp ðgi Þ2
@gj :
0, if i 6¼ j

where i ¼ 1, 2, :::, n; j ¼ 1, 2, :::, n; n is the number of items; and acov a ^ g^ is the variance-
^ , d,
covariance matrix for the estimates of item parameters under the slope-intercept parameterization
provided by the program FlexMIRT 3.0. It is noted that the similar approach can be also
applied to other IRT models (e.g., the polytomous IRT models) to derive the appropriate
variance-covariance matrices for the multiple imputation method and the delta method in test
equating when necessary.
In the study, the separate calibration procedure was considered in test equating to place
the IRT true scores derived from two test forms onto a common scale in the context of the
CINEG equating design. Apart from the separate calibration method, concurrent calibration is
another popular procedure for achieving the same goal (Hanson & Beguin, 2002; Kim & Cohen,
1998). The separate calibration procedure runs separate IRT model calibrations for each of the
THE JOURNAL OF EXPERIMENTAL EDUCATION 21

tests to be equated and then conducts equating to place the true scores derived from the different
tests onto the same metric. In contrast, the concurrent calibration procedure estimates the item
parameters for all the items in the tests to be equated simultaneously in a single run of IRT
model calibration. The estimates of the item parameters for all the items from the different tests
are placed directly on the same scale after calibration. The true scores calculated based on the
item parameter estimates on the tests are also on the same metric. Future research can further
compare the standard errors for the true score equating coefficients derived using the multiple
imputation method with those obtained using the concurrent calibration procedure (Andersson,
2018; Wong, 2015).
Overall, the current study introduces the multiple imputation method as an alternative to the
bootstrap method and the delta method for estimating the standard errors for the IRT true score
equating coefficients in the context of the CINEG equating design for the 3PL IRT model.
Generally, the results from the simulated and real data suggest that this multiple-imputation
based approach performed as effectively as the bootstrap method and the delta method in deter-
mining the variability of the IRT true score equating coefficients. The application of the multiple
imputation method has significant practical implications for testing researchers and practitioners
for facilitating the reporting of standard errors of equating in their practices, particularly in the
circumstances involving complex linking plans (e.g., chain equating; Battauz, 2013; Kolen &
Brennan, 2004) and complicated IRT models and equating methods (Zhang & Zhao, 2019).

Acknowledgments
The authors would like to thank the editor and the anonymous reviewers for their helpful and construct-
ive comments.

References
American Education Research Association, American Psychological Association, & National Council on
Measurement in Education. (2014). Standards for educational and psychological testing. American Educational
Research Association.
Andersson, B. (2018). Asymptotic variance of linking coefficient estimators for polytomous IRT models. Applied
Psychological Measurement, 42(3), 192–205. doi:10.1177/0146621617721249
Andersson, B., & Wiberg, M. (2017). Item response theory observed-score kernel equating. Psychometrika, 82(1),
48–66. doi:10.1007/s11336-016-9528-7
Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients.
Journal of Educational Measurement, 28(2), 147–162. doi:10.1111/j.1745-3984.1991.tb00350.x
Battauz, M. (2013). IRT test equating in complex linkage plans. Psychometrika, 78(3), 464–480. doi:10.1007/s11336-
012-9316-y
Battauz, M. (2015a). Factors affecting the variability of IRT equating coefficients. Statistica Neerlandica, 69(2),
85–101. doi:10.1111/stan.12048
Battauz, M. (2015b). equateIRT: An R package for IRT test equating. Journal of Statistical Software, 68(7), 1–22.
doi:10.18637/jss.v068.i07
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord &
M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–472). Addison-Wesley.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: application of an
EM algorithm. Psychometrika, 46(4), 443–459. doi:10.1007/BF02293801
Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items. Psychometrika,
35(2), 179–197. doi:10.1007/BF02291262
Cai, L. (2015). FlexMIRT_ 3.0: Flexible multilevel multidimensional item analysis and test scoring [Computer soft-
ware]. Vector Psychometric Group, LLC.
Cook, L. L., & Eignor, D. R. (1991). IRT equating methods. Educational Measurement: Issues and Practice, 10(3),
37–45. doi:10.1111/j.1745-3992.1991.tb00207.x
De Ayala, R. J. (2013). The theory and practice of item response theory. Guilford Publications.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Chapman & Hall.
22 Z. ZHANG

Haebara, T. (1980). Equating logistic ability scales by a weighted leas squares method. Japanese Psychological
Research, 22(3), 144–149. doi:10.4992/psycholres1954.22.144
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.
Hanson, B. A., & Beguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using
separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement,
26(1), 3–24. doi:10.1177/0146621602026001001
Kaskowitz, G. S., & De Ayala, R. J. (2001). The effect of error in item parameter estimates on the test response
function method of linking. Applied Psychological Measurement, 25(1), 39–52. doi:10.1177/01466216010251003
Kim, D.-I., Brennan, R., & Kolen, M. (2005). A comparison of IRT equating and beta 4 equating. Journal of
Educational Measurement, 42(1), 77–99. doi:10.1111/j.0022-0655.2005.00005.x
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational
Measurement, 43(4), 355–381. doi:10.1111/j.1745-3984.2006.00021.x
Kim, S. H., & Cohen, A. S. (1998). A comparison of linking and concurrent calibration under item response the-
ory. Applied Psychological Measurement, 22(2), 131–143. doi:10.1177/01466216980222003
Kim, S. H., & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under the graded response
model. Applied Psychological Measurement, 26(1), 25–41. doi:10.1177/0146621602026001002
Kim, S., & Lee, W.-C. (2006). An extension of four IRT linking methods for mixed-format tests. Journal
of Educational Measurement, 43(1), 53–76. doi:10.1111/j.1745-3984.2006.00004.x
Kolen, M. J. (1985). Standard errors of Tucker equating. Applied Psychological Measurement, 9(2), 209–223. doi:10.
1177/014662168500900209
Kolen, M., & Brennan, R. (2004). Test equating, scaling, and linking: methods and practices (2nd ed.).
Springer-Verlag.
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the rasch model. Journal of Educational
Measurement, 17(3), 179–193. doi:10.1111/j.1745-3984.1980.tb00825.x
Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of
Educational Measurement, 14(2), 139–160. doi:10.1111/j.1745-3984.1977.tb00033.x
Mislevy, R. J., Wingersky, K. M., & Sheehan, K. M. (1994). Dealing with uncertainty about item parameters:
Expected response functions (Research Report No. 94-28). Educational Testing Service.
Mislevy, R. J., & Sheehan, K. M. (1989). Information matrices in latent-variable models. Journal of Educational
Statistics, 14(4), 335–350. doi:10.2307/1164943
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological
Measurement, 16(2), 159–176. doi:10.1177/014662169201600206
Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review
(Otaru University of Commerce), 51, 1–23.
Ogasawara, H. (2001a). Item response theory true score equatings and their standard errors. Journal of Educational
and Behavioral Statistics, 26(1), 31–50. doi:10.3102/10769986026001031
Ogasawara, H. (2001b). Standard errors of item response theory equating/linking by characteristic curve methods.
Applied Psychological Measurement, 25(1), 53–67. doi:10.1177/01466216010251004
Paek, I., & Cai, L. (2014). A comparison of item parameter standard error estimation procedures for unimensional
and multidimensional item response theory modelling. Educational and Psychological Measurement, 74(1),
58–76. doi:10.1177/0013164413500277
Patton, J. M., Cheng, Y., Yuan, K. H., & Diao, Q. (2014). Bootstrap standard errors for maximum likelihood ability
estimates when item parameters are unknown. Educational and Psychological Measurement, 74(4), 697–712. doi:
10.1177/0013164413511083
R Development Core Team. (2016). R: A language and environment for statistical computing. R Foundation for
Statistical Computing.
Raju, N. S., Fortmann-Johnson, K. A., Kim, W., Morris, S. B., Nering, M. L., & Oshima, T. C. (2009). The item
parameter replication method for detecting differential functioning in the polytomous DFIT framework. Applied
Psychological Measurement, 33(2), 133–147. doi:10.1177/0146621608319514
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. (Psychometrika
Monograph No. 17). Psychometric Society. doi:10.1007/BF03372160
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological
Measurement, 7(2), 201–210. doi:10.1177/014662168300700208
Thissen, D., & Wainer, H. (1990). Confidence envelopes for item response theory. Journal of Educational Statistics,
15(2), 113–128. doi:10.2307/1164765
Tsai, T. H., Hanson, B. A., Kolen, M. J., & Forsyth, A. R. A. (2001). A comparison of bootstrap standard errors of
IRT equating methods for the common-item nonequivalent groups design. Applied Measurement in Education,
14(1), 17–30. doi:10.1207/S15324818AME1401_03
Wong, C. C. (2015). Asymptotic standard errors for item response theory true score equating of polytomous items.
Journal of Educational Measurement, 52, 106–120.
THE JOURNAL OF EXPERIMENTAL EDUCATION 23

Yang, J. S., Hansen, M., & Cai, L. (2012). Characterizing sources of uncertainty in item response theory scale
scores. Educational and Psychological Measurement, 72(2), 264–290. doi:10.1177/0013164411410056
Zhang, Z. H. (2019, April). Standard errors of IRT parameter scale transformation coefficients: comparison of
bootstrap method, delta method, and multiple imputation method. Journal of Educational Measurement, 56(2),
302–330.
Zhang, Z. H., & Zhao, M. R. (2019). Asymptotic standard errors of polytomous GPCM true score equating by
response function equating methods. Paper presented at the National Council on Measurement in Education
Annual Meeting, Toronto, Canada.

Вам также может понравиться