2013 GrohmannKauffeld Evaluatingtrainingprograms

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/256061891
Evaluating Training Programs: Development and Correlates of the

Questionnaire for Professional Training Evaluation
Article in International Journal of Training and Development · June 2013

DOI: 10.1111/ijtd.12005
CITATIONS READS
60 4,748
2 authors, including:
Simone Kauffeld
Technische Universität Braunschweig
383 PUBLICATIONS 2,946 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
EnEff Campus 2020: Integral energetic masterplan View project
Quo vadis Post-Doc: Professur, Wirtschaft oder prekäres Arbeitsverhältnis? Individuelle, soziale und organisationale Faktoren für die Laufbahnentwicklung und den
Laufbahnerfolg des wissenschaftlichen Nachwuchses (ProWi+) View project
All content following this page was uploaded by Simone Kauffeld on 09 July 2016.
The user has requested enhancement of the downloaded file.

bs_bs_banner
International Journal of Training and Development 17:2

ISSN 1360-3736
doi: 10.1111/ijtd.12005
Evaluating training programs:

development and correlates of
the Questionnaire for Professional
Training Evaluation
Anna Grohmann and Simone Kauffeld
Psychometrically sound evaluation measures are vital for

examining the contribution of professional training to organi-
zational success in a reliable manner. As training evaluations
tend to be both time-consuming and labor-intensive, there is an
increasing demand for economic evaluation inventories. Simul-
taneously, evaluation measures have to meet psychometric
standards. The present paper develops a time-efficient training
evaluation questionnaire that (1) has psychometrically sound
properties; (2) measures more than the participants’ reactions;
and (3) is widely applicable across different training contents,
thus allowing for comparisons of training programs within
and between organizations. The Questionnaire for Professional
Training Evaluation is primarily developed for use in practice
but is also applicable to field research and covers short-term as
well as long-term training outcomes. Analyses based on a total
of n = 1134 employees show the stability of the factor structure
and hint at the questionnaire’s differential and discriminant
validity. Theoretical and practical implications are discussed.
Introduction
Professional training is costly for contemporary organizations (e.g. Grossman & Salas,
2011). In 2010, for example, US organizations invested a total of approximately
❒ Anna Grohmann, Research Associate, Department of Industrial/Organizational and Social

Psychology, TU Braunschweig, Germany. Email: a.grohmann@tu-braunschweig.de. Simone Kauffeld,
Professor, Department of Industrial/Organizational and Social Psychology, TU Braunschweig,
Germany. Email: s.kauffeld@tu-braunschweig.de
An earlier version of this article was presented at the 2009 conference of the Work, Organizational, and
Business Psychology section of the German Psychological Society at Vienna, Austria. We would like to
thank Anja Heine for her efforts collecting data for study 1, Jessica Nitschke for her efforts collecting
data for study 2, and Wiebke Brune for her efforts collecting data for study 3. We would also like to
thank the editor and two anonymous reviewers for their helpful comments on an earlier draft of this
article.
© 2013 Blackwell Publishing Ltd.
Development and correlates of the Q4TE 135

US$171.5 billion in human resource development (HRD) and professional training
courses (Green & McGill, 2011). However, researchers and practitioners agree that
these investments are necessary for attracting and retaining qualified employees, for
keeping up with modern technological requirements and for gaining competitive
advantages (Aguinis & Kraiger, 2009; Martin, 2010; Reed, 2001, p. 59). Professional
training is critical to organizational success (Giangreco et al., 2010; Grossman & Salas,
2011), and it is associated with organizational performance and innovation (Aguinis &
Kraiger, 2009). For employees, participation in high-quality training is expected to
enhance opportunities for advancement, skill development and professional growth
(Combs et al., 2006). For trainers, effective training can showcase their work and serve
for marketing purposes (cf. Kraiger, 2002).
In times of financial crisis, on the other hand, organizations often tend to cut or reduce
training budgets because ‘training employees in new skills and competences is seen as
an unaffordable luxury’ (Roche et al., 2011, p. 47). Taking into account the costs as well as
the potential advantages of training and HRD, there is still uncertainty in organizations
about the actual benefits of professional training (Blume et al., 2010). Organizations want
to know if the training benefits justify the financial investments made and if the
knowledge and skills acquired in a training course are indeed used at work (Kauffeld
et al., 2008). There is a large gap between learning outcomes and the actual training
transfer, i.e. the degree to which training contents are applied to practice (Aguinis &
Kraiger, 2009; Broad, 1997; Grossman & Salas, 2011). The majority of training contents is
not transferred to the job (for an overview, see Grossman & Salas, 2011). However, a lack
of training transfer can result in high financial costs and can be highly time-consuming
(e.g. Laker & Powell, 2011). To identify promising trainings with, for example, a high
amount of transfer, it is vital for organizations and human resource professionals to
evaluate and document training benefits (Aguinis & Kraiger, 2009).
In many organizations, training evaluation is based solely on the participants’ reac-
tion assessed immediately after a training course (Nickols, 2005). However, for deter-
mining the actual training benefits it is important to evaluate not only short-term
outcomes (e.g. reactions at the end of the training), but also participants’ long-term
outcomes back at work (e.g. transfer to practice; Wang & Wilcox, 2006). Moreover, a
recent German Delphi study on the future of HRD has pointed out that the pressure on
HRD departments will increase further (Schermuly et al., 2012). HRD professionals will
have to evaluate the benefits of trainings systematically because the aims and benefits
of training programs will continuously be questioned in future (Schermuly et al., 2012).
For practitioners, evaluation measures need to be feasible and meet practical
demands (e.g. Giangreco et al., 2010). In terms of usability, questionnaires have to
be well accepted by respondents and easily applicable to a wide variety of training
courses. In addition, time-efficient and economic training evaluation measures are of
growing significance in today’s fast-moving business environment (Aguinis & Kraiger,
2009). For researchers, on the other hand, economic evaluation measures are important
for examining comprehensive models on training transfer (e.g. Holton, 2005), in which
training benefits are only a minor part of the investigation. Beyond the requirement for
short measures, researchers have to establish the psychometric properties of a ques-
tionnaire (e.g. Aiken & Groth-Marnat, 2006). Although theoretical frameworks for
training evaluation are numerous, the development and psychometric investigation of
new evaluation inventories is scarce (Aguinis & Kraiger, 2009).
The present paper contributes new insights for economic training evaluation. We
address the following requirements: the evaluation questionnaire has to (1) cover not
only short-term, but also long-term training outcomes (e.g. Wang & Wilcox, 2006); (2)
meet practical demands such as time efficiency (Aguinis & Kraiger, 2009); and (3) show
sound psychometric properties (cf. Aguinis & Kraiger, 2009; see also Aiken & Groth-
Marnat, 2006). First, we develop the Questionnaire for Professional Training Evaluation
(Q4TE) as a tool well-adapted to practical needs that covers short-term (e.g. reactions)
and long-term training outcomes (e.g. transfer; Wang & Wilcox, 2006). Second, we
design the Q4TE in an economic, time-efficient manner which ensures the organiza-
tions’, participants’ as well as trainers’ acceptance of standard evaluation procedures
136 International Journal of Training and Development

and offers the possibility of integrating the Q4TE in larger surveys. In doing so, we
strive for a general wording nonspecific to the training content, thus permitting
applicability across different training contents and settings. Finally, we examine the
questionnaire’s reliability and underlying factor structure based on the data of three
online studies. Moreover, our analyses provide first hints at the differential and discri-
minant validity of the Q4TE.
Benefits of training evaluation

For organizations, there are several arguments for training evaluation. It can justify the
financial input made, serve for quality management purposes, provide feedback to
human resource departments and trainers for improving training courses, and help to
make more accurate decisions about the continuation of training courses (Kaufman
et al., 1996; Kirkpatrick & Kirkpatrick, 2006, p. 17; Kraiger, 2002). Moreover, training
evaluation results can be used as a marketing tool for human resource departments and
training institutes to attract potential job candidates and retain qualified employees
in an increasingly competitive job market (Kraiger, 2002). Organizations and field
researchers facing the benefits of training evaluation and striving for evaluating train-
ing programs are challenged to meet the following requirements for evaluation meas-
ures. Questionnaires have to show a high usability in terms of time efficiency (Aguinis
& Kraiger, 2009). In organizations often broader evaluation issues, such as exploring
determinants of successful training transfer, have to be addressed in order to optimize
future training courses (e.g. Bates, 2004). Furthermore, training courses often have to be
evaluated with standard evaluation measures in addition to training-specific measures
to determine reference values for different trainings. Both of these situations can end up
having to use lengthy surveys. Therefore, it is all the more important to use short and
concise standard evaluation measures to increase the respondents’ acceptance of these
surveys. In addition to high usability, training evaluation demands psychometrically
sound and theoretically grounded measures (Aguinis & Kraiger, 2009; Pershing &
Pershing, 2001).
Models of training evaluation

There exist several theoretical models and frameworks for summative training evalu-
ation (for an overview, see Aguinis & Kraiger, 2009; Salas & Cannon-Bowers, 2001),
which aims at making ‘judgments about a program’s effectiveness or worth’ (Kraiger,
2002, p. 336). For example, Wang and Wilcox (2006) distinguish between short- and
long-term evaluation. The former aims at measuring the learner’s reactions (e.g. with an
attitudinal questionnaire) and learning outcomes (e.g. knowledge tests; Wang &
Wilcox, 2006). The latter refers to the assessment of behavior change (e.g. behavior
ratings) as well as to organizational results (e.g. ratings of service quality) and return on
investment (Wang & Wilcox, 2006).
Wang and Wilcox (2006) draw upon Kirkpatrick’s four-level framework (see
Kirkpatrick, 1967; Kirkpatrick & Kirkpatrick, 2006). The Kirkpatrick scheme is a very
popular and widely applied framework for summative evaluation (Blau et al., 2012),
which also can be found in many other evaluation approaches (e.g. Alliger et al., 1997;
Nickols, 2005; for an overview, see Alvarez et al., 2004). Within Kirkpatrick’s
framework, the following four levels are distinguished: (1) reaction, i.e. participants’
emotional reactions to the training; (2) learning, i.e. acquisition of methodological,
procedural and expert knowledge as well as attitude change through training;
(3) behavior, i.e. application of training contents (e.g. methodologies) at work; and
(4) results, in terms of a training’s organizational impact, for example concerning time
and costs (Kirkpatrick, 1967; Kirkpatrick & Kirkpatrick, 2006). All four levels are impor-
tant for training evaluation (Kirkpatrick & Kirkpatrick, 2006): Organizations can use the
reaction level as an indicator of customer satisfaction, and the learning level is assumed
to be a requirement for behavior change. Behavior level results demonstrate how
training contents are actually applied to the job and thereby if they are organizationally

usable. Finally, the results level shows how the training contributes to organizational
success (Kirkpatrick & Kirkpatrick, 2006). Ideally, all four levels should be evaluated
with independent measures, which, however, becomes more time-consuming with
increasing levels, i.e. from level 1 to level 4 (Kirkpatrick & Kirkpatrick, 2006, p. 21).
Although it is implicitly assumed that each level influences the next (Kirkpatrick &
Kirkpatrick, 2006, p. 21), empirical studies have shown varying interrelationships
between the four levels ranging from close to zero to moderate values (Alliger et al.,
1997). For example, affective reactions are less predictive of learning or training
transfer than utility reactions (Alliger et al., 1997). Notwithstanding various criticism
(e.g. about the assumption of growing significance from the first to the last level,
the oversimplification of relevant influences on training success and the unclear
cause–effect relations between the four levels; Bates, 2004), Kirkpatrick’s model
remains an important evaluation framework, which is applied worldwide in practice as
well as in research (e.g. Alvarez et al., 2004; Nickols, 2005; Salas & Cannon-Bowers,
2001).
Researchers have to fill in concrete measures within Wang and Wilcox’s (2006) or
Kirkpatrick’s evaluation scheme because both are frameworks and not diagnostic tools
(cf. Nickols, 2005; Wang & Wilcox, 2006). For example, the learning level can be
assessed with knowledge tests and the behavior level with direct behavioral observa-
tions (e.g. Wang & Wilcox, 2006). This has to be realized in the context of a specific
training; for example, a knowledge test has to cover the specific content learned in a
training course. However, as Grossman and Salas (2011) have pointed out for transfer-
related factors, ‘organizations cannot feasibly incorporate every factor that has
been linked to transfer into their training programs’ (p. 117). To account for practical
demands, organizations and researchers have to focus on a limited number of evalua-
tion aspects (Grossman & Salas, 2011). A similar picture emerges when considering
summative evaluation. Although practitioners are aware of the need for evaluation, due
to time constraints they are faced with the difficulty of deciding which of the Kirk-
patrick levels is the most important for the respective training. Overall, many organi-
zations are deterred from conducting training-specific evaluations because they often
do not have enough time or professional resources for developing psychometrically
sound evaluation measures for each specific training course and each single level of
Kirkpatrick’s framework (cf. Aguinis & Kraiger, 2009). Moreover, there is no time
regularly to adapt evaluation questionnaires to the respective training purpose. Thus,
oftentimes only the reaction level is measured (Nickols, 2005) because reaction data are
collected most easily (Alliger et al., 1997). However, as pointed out, not only the reac-
tion level but all four levels of Kirkpatrick’s model may be vital to evaluate training
benefits (Kirkpatrick & Kirkpatrick, 2006).
Training evaluation surveys

Training evaluation surveys offer an economic way to gather information on specific
training outcomes (e.g. reaction or learning) because they can easily be applied to a
large group of respondents (Stoughton et al., 2011). Many evaluation surveys focus on
the reaction level (Arthur et al., 2003; Brown, 2005). For example, Tracey et al. (2001)
successfully examined a model that divides trainee reactions in perceived training
utility and affective reactions, using a 15-item measure by Mathieu et al. (1992). Morgan
and Casper (2000) examined the underlying factor structure of reaction items in dif-
ferent training courses and found six latent factors (e.g. satisfaction with the instructor
and perceived training utility). Lim and Morris (2006) measured the first three Kirk-
patrick levels using questionnaires. Among others, they applied a training satisfaction
survey consisting of 10 items, a training-specific learning measure covering the respec-
tive learning objectives with 13 items and a training-specific transfer survey assessing
the perceived applicability of the learning objectives with 13 items. Based on two
different previous transfer measures, Devos et al. (2007) assessed transfer directly, i.e.
the application of the training to the job with three items, and transfer indirectly, i.e. the
consequences of applying the training to the job with four items. Overall, the impact of

SAT: Satisfaction
Reaction
UT: Utility
Short-term
evaluation
Learning KNOW: Knowledge
Behavior APP: Application to practice

Long-term
evaluation
Organizational I-OR: Individual organizational results
results G-OR: Global organizational results
Figure 1: Scales of the Q4TE (framework following Wang & Wilcox, 2006; Kirkpatrick &
Kirkpatrick, 2006).
Kirkpatrick’s organizational level seems to be most difficult for trainees to assess (Wang
& Wilcox, 2006). To our knowledge, there is no questionnaire that covers all four levels
of Kirkpatrick’s evaluation framework in a time-efficient manner while being applica-
ble to a wide variety of trainings contents and psychometrically examined.
In the present paper, we address this issue by developing the Q4TE, a time-efficient
and widely applicable self-report measure especially for practitioners. The Q4TE covers
all four levels of Kirkpatrick’s evaluation framework (see Figure 1). Level 1, reaction, is
assumed to be multidimensional and often divided into affective responses and utility
judgments (Alliger et al., 1997; Tracey et al., 2001). Therefore, the first level of the Q4TE
is divided into global satisfaction with the training and perceived training utility. Level
2, learning, refers to the skills and knowledge acquired in a training (e.g. Wang &
Wilcox, 2006). In the Q4TE, we focus on knowledge, which refers to participants’
perceived knowledge acquisition. Level 3, behavior, refers to changes in behavior as a
consequence of training participation (Kirkpatrick & Kirkpatrick, 2006, p. 22). In the
Q4TE, we measure application to practice, which refers to the extent to which the
training contents are applied at work (Aguinis & Kraiger, 2009). Level 4, organizational
results, is kept rather unspecified in the Kirkpatrick model (Alliger et al., 1997). To
clarify this, the Q4TE takes into account that there are three main aspects relevant for
evaluating organizational results: qualitative, temporal and financial impact of training
participation (Wang & Wilcox, 2006). As the costs or financial impact is difficult to assess
with self-report items, level 4 of the Q4TE aims at covering especially the qualitative,
but also the temporal impact. In line with the multifoci perspective (e.g. concerning
organizational citizenship behavior; Lavelle et al., 2007), the Q4TE differs between
individual and global organizational results. We thereby account for the fact that
training may have an effect on the organization, which is, in turn, reflected on the single
employee (individual organizational results) and on the whole organization (global
organizational results). It is important to note that the Q4TE scales knowledge, appli-
cation to practice, individual, and global organizational results measure the perceived
training benefits. For simplification purposes, however, they are henceforth referred to
without this specification.
Current research questions

The present paper aims at examining the underlying factor structure, the differential
and the discriminant validity of the Q4TE, which differentiates between the six
scales satisfaction, utility, knowledge, application to practice, individual organizational
results and global organizational results.
Underlying factor structure

Meta-analytic findings by Alliger et al. (1997) show the highest intercorrelations
between different scales (e.g. affective and utility reactions) of the same Kirkpatrick

level. However, they also found differential effects for different scales of the same
level. For example, utility reactions showed stronger relationships with transfer
than affective reactions (Alliger et al., 1997). We therefore treat the individual Q4TE
scales (satisfaction, utility, knowledge, application to practice, individual organiza-
tional results and global organizational results; see also Figure 1) separately during the
scale development process. Moreover, the Q4TE covers the participants’ reaction,
learning, transfer and organizational outcomes (Kirkpatrick & Kirkpatrick, 2006), and
addresses both short- and long-term evaluation of training courses (Wang & Wilcox,
2006). Drawing upon different theoretical frameworks (Alliger et al., 1997; Kirkpatrick
& Kirkpatrick, 2006; Wang & Wilcox, 2006), our first research question is to explore
whether the six first-order factors (satisfaction, utility, knowledge, application to prac-
tice, individual organizational results and global organizational results) can be grouped
into two second-order factors (short- and long-term evaluation).
Differential and discriminant validity

Transfer of training to the job is a crucial variable in training evaluation because it
indicates if training contents are indeed applied to practice (Aguinis & Kraiger, 2009;
Saks & Burke, 2012). Concerning the differential validity of the Q4TE, our second
research question is to explore differences in Q4TE scales between participants who
report having managed to transfer training contents and participants who report not
having done so. Concerning the discriminant validity of the Q4TE, our third research
question is to explore the relationship between the Q4TE scales and transfer quantity,
i.e. the number of training contents applied at work (Kauffeld et al., 2008, 2009;
Kauffeld & Lehmann-Willenbrock, 2010).
Study overview
To investigate the psychometric properties of the Q4TE, we use three studies with a
total of n = 1134 employees. In study 1, the Q4TE is developed. In study 2, we address
the first research question and examine the underlying factor structure of the Q4TE.
Finally, in study 3, we address the second and third research questions and explore the
relationship between the Q4TE scales and transfer to practice.
Study 1 – questionnaire development

Method
Procedure and participants. A final sample of n = 408 employees from different branches
(e.g. information technology, service sector and automobile industry) was recruited
with an online survey. We included only employees who stated that they had answered
the survey seriously. Moreover, we included only those professional training courses
that dated back between 4 weeks and 2 years. A minimum time lag of 4 weeks was used
to permit knowledge transfer into practice (e.g. May & Kahnweiler, 2000). Moreover, a
6-year study on memory of daily events revealed that less than 1 percent of the events
were forgotten during the first year, whereas after then this rate increased by around
5 percent to 6 percent annually (Linton, 1982). Assuming an error rate of about 5
percent, considering trainings that date back at most 2 years seemed appropriate for
our study. The final data set covered a diverse set of contents (e.g. foreign language
courses and information technology trainings). Using a 3-point answering scale,
respondents could specify the type of training content ranging from closed to open
skills (see Blume et al., 2010). The majority of training courses focused on closed (39.7
percent) and open skills (36.3 percent), respectively, whereas 24.0 percent dealt with
both open and closed skills. Participants’ age ranged from 21 to 61 years (36.4 years on
average; 2.2 percent not specified). The sample had a balanced gender distribution (54.7
percent male, 44.1 percent female, 1.2 percent not specified). Average organizational
tenure was 6.2 years (6.4 percent not specified).

Measures. The Q4TE is an advanced version of the initial Q4TE form by Kauffeld
et al. (2009), which had not shown optimal psychometric properties in terms of con-
firmatory factor analysis (CFA) results yet. To create the final Q4TE, we refined the
theoretical structure of the initial Q4TE and adapted some items (Kauffeld et al.,
2009). Other items were generated in our work group. Moreover, the final Q4TE
also comprised single items from literature, which were adapted and used to enrich
the item pool (see Table 1 for references for the final Q4TE). The resulting question-
naire consisted of 36 items, which can be assigned to six scales: satisfaction, utility,
knowledge, application to practice, individual organizational results and global
organizational results (see Figure 1). Usually, respondents show positive training
reactions (e.g. high satisfaction), which leads to low variance in reaction measures
(Alliger & Janak, 1989). Therefore, an 11-point response scale ranging from 0
percent = completely disagree (coded as 0) to 100 percent = completely agree (coded as
10) with single steps of 10 percent increase each (corresponding to 1) was used to
increase the variance in responses.
Preliminary analysis. Multi-item questionnaires are expensive while hardly offer-

ing more information than do single- or two-item scales (Drolet & Morrison, 2001).
Questionnaires with only two items per dimension have shown promising psy-
chometric properties, as successfully demonstrated in personality research (e.g.
Rammstedt, 2007; Rammstedt & John, 2007). Thus, the aim of the present data analy-
ses was to develop a questionnaire with only two items per scale to keep the Q4TE
cost- and time-efficient. For item selection, it is important to consider statistical
properties (e.g. item frequency distribution) as well as nonstatistical properties
such as inspection of item content (Kline, 1986). In a first step, an item analysis
was conducted on all 36 items in the entire sample 1 (n = 408). A total of 14
items was deleted due to statistical (frequency distributions, item homogeneity and
difficulties) and nonstatistical properties (e.g. overlapping item content). Within
the scales individual and global organizational results, almost all of the remaining
items showed bi- or multimodal frequency distributions. From a content per-
spective, all remaining items of both scales were kept for further analyses. The
total of 22 remaining items showed skewness between -1.28 and 0.50 and
kurtosis between -1.28 and 1.01, which indicates no severe violation of the normality
assumption.
Data analysis. To investigate the stability of the reduced form, sample 1 (n = 408) was
randomly split into two subsamples with a ratio of about 60:40 by means of the
Predictive Analytics SoftWare (PASW) random case selection procedure. Subsample A
(n = 251) was used for item reduction and subsample B (n = 157) was used for the
investigation of stability. The ratio of 60:40 was chosen to account for the fact that
subsample B was to serve for examining a model with smaller complexity. First, to
explore the underlying factor structure that best represents the present data, an
exploratory factor analysis (EFA) was performed on subsample A. Second, we con-
ducted a CFA on subsample A, taking into account the EFA results and considering
theoretical assumptions (distinguishing between single Q4TE scales, cf. Alliger et al.,
1997). By means of CFA we reduced the number of items per scale further in order to
get a final Q4TE form with two items per scale. Item selection was based on statistical
(e.g. factor loadings and modification indices) and nonstatistical properties (item
wording; cf. Kline, 1986). Third, the resulting Q4TE form had to be reexamined via CFA
on subsample B to assess its stability.
EFA was conducted with PASW 18 (SPSS Inc., Chicago, IL), and CFA was con-
ducted with Mplus 6 (Muthén & Muthén, 1998–2010). As model evaluation should
be based on multiple criteria (Byrne, 2005), we used the ratio of c2 to degrees of
freedom (d.f.), RMSEA, CFI and SRMR (Schweizer, 2010; see Schermelleh-Engel et al.,
2003 for cutoff values). For all CFA analyses, we applied a maximum like-
lihood estimator robust to non-normally distributed data (MLR; Muthén & Muthén,
1998–2010).

Table 1: Q4TE items and English translation
Scalea Item wording References
Reaction Satisfaction Ich werde das Training in guter Erinnerung Adaptation from Bihler
behalten. (2006, p. 200)
I will keep the training in good memory.
Das Training hat mir sehr viel Spaß gemacht. Additional item
I enjoyed the training very much. developed in our
work group (see also
Brown, 2005)
Utility Das Training bringt mir für meine Arbeit sehr Additional item
viel.b developed in our
The training is very beneficial to my work.b work group (see also
Mathieu et al., 1992)
Die Teilnahme am Training ist äußerst Adaptation of the
nützlich für meine Arbeit. initial Q4TE form by
Participation in this kind of training is very Kauffeld et al. (2009)
useful for my job.
Learning Knowledge Ich weiß jetzt viel mehr als vorher über die Adaptation from
Trainingsinhalte. Deschler et al. (2005,
After the training, I know substantially p. 34)
more about the training contents than
before.
In dem Training habe ich sehr viel Neues Following the initial
gelernt. Q4TE form by
I learned a lot of new things in the Kauffeld et al. (2009)
training.
Behavior Application Die im Training erworbenen Kenntnisse nutze Adaptation of the
to practice ich häufig in meiner täglichen Arbeit. initial Q4TE form by
In my everyday work, I often use the Kauffeld et al. (2009)
knowledge I gained in the training.
Es gelingt mir sehr gut, die erlernten Following Gnefkow
Trainingsinhalte in meiner täglichen Arbeit (2008, p. 263)
anzuwenden.
I successfully manage to apply the training
contents in my everyday work.
Organizational Individual Seit dem Training bin ich mit meiner Arbeit Additional item
results zufriedener. developed in our
Since the training, I have been more work group (see also
content with my work. Ironson et al., 1989)
Durch die Anwendung der Trainingsinhalte Following Ong et al.
hat sich meine Arbeitsleistung verbessert. (2004)
My job performance has improved through
the application of the training contents.
Global Durch die Anwendung der Trainingsinhalte Adaptation of the
konnten Arbeitsabläufe im Unternehmen initial Q4TE form by
vereinfacht werden. Kauffeld et al. (2009)
Overall, it seems to me that the application
of the training contents has facilitated the
work flow in my company.
Durch das Training hat sich das Adaptation of the
Unternehmensklima verbessert. initial Q4TE form by
Overall, it seems to me that the Kauffeld et al. (2009)
organizational climate has improved due
to the training.
Note: Adaptation means a maximum of three words of the original item wording was changed (e.g. to
adapt the item to the training field). Following means item content of the original item was used as basis
for item development.
a
If required for research purposes, researchers can additionally use the scale self-efficacy (not depicted
here). This scale was part of the initial Q4TE and contains two items which are adapted from Schyns and
von Collani (2002).
b
The tense of this item was adapted for our retrospective study.

Results and discussion
Underlying factor structure. In a first step, we examined the underlying factor structure
in subsample A (n = 251) by computing an EFA with principal axis factoring and
oblique rotation (direct oblimin). Both Bartlett’s test of sphericity and Kaiser–
Meyer–Olkin indicated sampling adequacy. The number of eigenvalues greater than 1
and an examination of the scree plot suggested that two factors were interpretable. The
two-factor solution accounted for a total of 74.61 percent of the variance.
As indicated by the pattern matrix, all items showed significant loadings (> 0.50) on
one factor and no substantial double-loadings (> 0.40) on the other factor except for two
items, which were excluded from further analyses. Investigation of the pattern matrix
showed that factor 1 can be interpreted as short-term evaluation and factor 2 as long-
term evaluation. This result indicates that the distinction between short- and long-term
evaluation (Wang & Wilcox, 2006) is appropriate.
Scale development. For ML estimation, a minimum ratio of at least five cases per free
parameter estimated is recommended (Bentler & Chou, 1987). Therefore, a model based
on the remaining 20 items (CFA on subsample A) and a model based on the reduced
form (CFA on subsample B), respectively, would not have been sufficient. Due to this
constraint, we specified two separate CFAs (covering short- and long-term outcomes,
respectively) for each subsample (see Figure 2 for subsample A analysis).
Short-term evaluation. We specified a model with three latent, intercorrelated factors

(satisfaction, utility and knowledge; see Figure 2a). CFA results indicated a lack of
model fit because only the SRMR value was good (see Table 2). In a stepwise procedure,
the full set of 12 items was reduced to six items based on modification indices and
residual variances in combination with inspection of item wording. The reduced model
with two items per scale obtained a good fit to subsample A (see Table 2). Investigation
of the stability of this solution in subsample B also provided good model fit
(see Table 2).
Long-term evaluation. A model with three latent, intercorrelated factors (application to

practice, individual organizational results and global organizational results) was speci-
fied based on subsample A (see Figure 2b). SRMR value was good, and CFI value was
acceptable, but neither c2/d.f. nor RMSEA were sufficient. In a stepwise procedure, we
a s1 b
s2
a1
SAT s3
APP a2
s4
a3
s5
u1 i1
UT u2 I-OR i2
u3 i3
k1
g1
k2 G-OR
KNOW g2
k3
k4
Figure 2: Specified CFA model for (a) short-term evaluation and (b) long-term evaluation in
subsample A (n = 251) of study 1 (error terms are not depicted). SAT = satisfaction,
UT = utility, KNOW = knowledge, APP = application to practice, I-OR = individual
organizational results, G-OR = global organizational results.

Table 2: CFA results in study 1 and 2
Model No. of c2 d.f. c2/d.f. RMSEA CFI SRMR

items
Study 1 (n = 408)
Subsample A (n = 251): short-term
evaluation
3-factor model (SAT with 5 items, 12 316.84 51 6.21 0.144 0.898 0.041
UT with 3 items, KNOW with 4
items)
items)
Subsample A (n = 251): long-term
evaluation
3-factor model (APP with 3 items, 8 66.20 17 3.89 0.107 0.953 0.035
I-OR with 3 items, G-OR with 2
items)
items)
Subsample B (n = 157): short-term
evaluation
items)
Subsample B (n = 157): long-term
evaluation
items)
Study 2 (n = 287)
Model 1: 6 latent, intercorrelated 12 100.40 39 2.57 0.074 0.966 0.030
factors (SAT, UT, KNOW, APP,
I-OR, G-OR)
Model 2: 6 latent first-order factors 12 167.51 47 3.56 0.095 0.933 0.051
(SAT, UT, KNOW, APP, I-OR,
G-OR) and 2 latent second-order
factors (S-TE and L-TE) following
Wang and Wilcox (2006)
Model 3: 4 latent, intercorrelated 12 215.81 48 4.50 0.110 0.907 0.051
factors following Kirkpatrick‘s
(1967) four-level model
SAT = satisfaction, UT = utility, KNOW = knowledge, APP = application to practice, I-OR =

individual organizational results, G-OR = global organizational results, S-TE = short-term
evaluation, L-TE = long-term evaluation.
[Note: Correction added on 4 March after initial online publication on 4 February 2013. CFI data for
Study 1, Subsample A, 3-factor model (SAT with 5 items, UT with 3 items, KNOW with 4 items)
should be 0.898. This has been corrected in this version of the article.]
identified a reduced form with a total of six items based on modification indices and
residual variances in combination with inspection of item wording. The reduced model
with two items per scale obtained a good fit to subsample A (see Table 2). Investigation
of the stability of this solution in subsample B also provided good model fit (see Table 2).
We successfully identified a Q4TE form with two items per scale, which makes
time-efficient training evaluations possible. However, the present results still had to be
cross-validated and examined in one CFA model comprising both short- and long-term
evaluation. This was realized in study 2.

Study 2 – cross-validation of the underlying factor structure
Method
Procedure and participants. Using an online survey, a final sample of n = 287 employees
from different branches (e.g. public sector, education system and health care) was
recruited. In the final sample, we included only those employees who stated that they
had answered seriously. Moreover, we included only training courses that dated back
between 4 weeks and 2 years (see study 1). The final data set covered a large variety of
training contents. The majority of trainings focused on closed skills (51.9 percent).
About one third (30.0 percent) dealt with both open and closed skills, whereas the
remaining 18.1 percent focused on open skills. Respondents’ age ranged from 19 to 75
years with an average of 34.1 years (1.0 percent not specified). About one third of the
participants were male (32.8 percent), and two thirds were female (66.9 percent, 0.3
percent not specified). Average organizational tenure was 5.3 years (5.6 percent not
specified).
Measures. The Q4TE was measured with the two items per scale identified in study 1
with an 11-point answering scale (see study 1).
Data analysis. CFA using the MLR-estimator in Mplus 6 (Muthén & Muthén, 1998–
2010) was applied to investigate the stability of the underlying factor structure.

Model 1 (see Figure 3) with six latent, intercorrelated factors, each measured by two
items, was specified. SRMR value indicated good model fit, CFI, c2/d.f. and RMSEA
value indicated acceptable model fit (see Table 2). The six latent, intercorrelated factors
are satisfaction, utility, knowledge, application to practice, individual organizational
results and global organizational results. We found moderate to high relationships
between all Q4TE scales. Intercorrelations between the six latent factors ranged
between 0.43 (between knowledge and global organizational results) and 0.91 (between
individual and global organizational results).
As a consequence of the high intercorrelations and two modification indices greater
than 10, we investigated a second-order factor model (see model 2, Figure 3) in line
with Wang and Wilcox (2006) and a four-factor model (see model 3, Figure 3) based on
Model 1 Model 2 Model 3

s1 s1 s1
SAT SAT
s2 s2 s2
Level 1
u1 u1 u1
UT Short-TE UT
u2 u2 u2
k1 k1 Level 2 k1
KNOW KNOW
k2 k2 k2
a1 a1 a1
APP APP Level 3
a2 a2 a2
i1 i1 i1
I-OR Long-TE I-OR
i2 i2 i2
Level 4
g1 g1 g1
G-OR G-OR
g2 g2 g2
Figure 3: CFA models for investigating the underlying factor structure in study 2
(n = 287). Error terms are not depicted. SAT = satisfaction, UT = utility,
KNOW = knowledge, APP = application to practice, I-OR = individual organizational
results, G-OR = global organizational results, Short-TE = short-term evaluation,
Long-TE = long-term evaluation.

the assumptions by Kirkpatrick (1967). For these two competing models, CFA results
indicated insufficient model fit indices except for an acceptable SRMR value (see
Table 2). Model 1 with six latent, intercorrelated factors fits the study 2 data best
(see Table 2). Addressing research question 1, these results lend further support to the
underlying factor structure of the Q4TE: the six latent, intercorrelated factors underly-
ing the Q4TE correspond to satisfaction, utility, knowledge, application to practice,
individual organizational results and global organizational results. The CFA results do
not support two latent second-order factors, which may be interpreted as short- and
long-term evaluation (cf. Wang & Wilcox, 2006), or a four-level model (cf. Kirkpatrick,
1967), compared with the six-factor model.
Study 3 – differential and discriminant validity of the Q4TE

Method
Procedure and participants. Using an online survey, we recruited a final sample of
n = 439 employees from different branches (e.g. health care, public service and retail
sector) who, according to their own statement, answered seriously. We included only
training courses that dated back between 4 weeks and 2 years (see study 1). The final
data set comprised diverse training contents. The majority of trainings (47.8 percent)
focused on closed skills, 32.8 percent dealt with open skills and 18.7 percent focused
on both open as well as closed skills (0.7 percent not specified). Participants’ age
ranged from 20 to 66 years with an average of 37.6 years (1.1 percent not specified).
The gender ratio was nearly balanced (44.2 percent male, 55.4 percent female, 0.4
percent not specified). Organizational tenure was 7.9 years on average (6.8 percent
not specified).
Measures. The Q4TE was measured with the two items per scale identified in study
1 and cross-validated in study 2, with an 11-point answering scale (see study 1).
Transfer to practice was measured with the item ‘Have you been able to transfer
training contents to practice?’, which had to be rated with yes or no (an adaptation of
Kauffeld et al., 2008, 2009). Moreover, we measured transfer quantity as the number
of steps transferred to practice (Kauffeld et al., 2008, 2009; Kauffeld & Lehmann-
Willenbrock, 2010) and used it as a more elaborated measure of transfer. The par-
ticipants were asked to write down up to 10 training contents they had been able to
transfer to practice.
Data analysis. We investigated group differences between employees who could trans-
fer training contents to practice and those who could not by means of two separate
multivariate analysis of covariance (MANCOVA) analyses in PASW (covering short-
and long-term outcomes, respectively). To investigate the relationship between the
Q4TE scales and transfer quantity, we conducted a multiple regression analysis in
PASW.

Means, standard deviations, intercorrelations and internal consistency values are
depicted in Table 3. Internal consistency values ranged between 0.79 and 0.96. Prior to
the analyses, we investigated the influence of age, gender, organizational tenure, time
lag between training and survey, course duration as well as type of training content on
our study variables (for intercorrelations, see Table 3). There were no gender effects. As
we found some significant correlations between the short-term outcomes (utility and
knowledge) and organizational tenure, time lag between training and survey, course
duration as well as type of training content, we included these variables as covariates in
our MANCOVA analysis of the short-term outcomes. Moreover, results showed some
significant correlations between our long-term outcomes (application to practice, indi-
vidual and global organizational results) and organizational tenure, course duration as
well as type of training content. Thus, we included these three variables as covariates

Table 3: Means, standard deviations, intercorrelations and reliability (internal consistency) values in study 3
Scales M SD SAT UT KNOW APP I-OR G-OR T TQ
Satisfaction (SAT) 7.99 2.31 (0.91)

Utility (UT) 7.03 2.79 0.67** (0.96)
Knowledge (KNOW) 7.40 2.49 0.72** 0.73** (0.91)
Application to pract. (APP) 5.72 2.90 0.50** 0.79** 0.58** (0.90)
Individual org. results (I-OR) 4.46 3.13 0.44** 0.66** 0.53** 0.78** (0.86)
Global org. results (G-OR) 3.11 2.92 0.28** 0.47** 0.38** 0.61** 0.74** (0.79)
a
Transfer (T)b 0.75 0.43 0.35** 0.44** 0.38** 0.51** 0.40** 0.32**
a
Transfer quantity (TQ) 2.18 2.17 0.30** 0.37** 0.33** 0.42** 0.38** 0.29** 0.59**
Age 37.59 10.49 0.03 0.06 0.01 0.09 0.04 0.03 0.09 0.15**
Genderc 1.44 0.50 -0.07 0.01 -0.03 0.04 -0.03 0.03 0.04 0.00
Organizational tenure 7.88 8.90 0.04 0.10* 0.08 0.15** 0.09 0.04 0.07 0.14**
Time lag between training and 7.42 6.29 -0.04 -0.10* -0.03 -0.07 -0.05 -0.04 -0.01 0.07
survey (months)
Course duration (hours) 90.36 446.89 0.04 0.10* 0.10* 0.12* 0.15** 0.14** 0.05 0.15**
Type of training contentd,e 1.85 0.89 0.08 -0.10* -0.01 -0.09* -0.03 -0.02 0.04 0.05
Note: Internal consistency values calculated with Cronbach’s a are presented diagonally in parentheses.
* p < 0.05, ** p < 0.01 (2-sided significance).
a
No internal consistency value was calculated for transfer and transfer quantity (one item each).
b
Transfer: 1 = yes, 0 = no.
c
Gender: 1 = female, 2 = male.
d
1 = closed skills, 2 = both (open and closed skills) and 3 = open skills.
e
Kendall‘s t correlations are depicted because type of training content is an ordinal variable.
M = mean, SD = standard deviation.

Table 4: Differences between participants who could transfer training contents into practice
and participants who could not are shown by analysis of covariance results in study 3
Scales Transfer: Yes Transfer: No Transfer: Yes/No
M SD M SD F
Short-term evaluation scalesa n = 300 n = 101

Satisfaction (SAT) 8.45 1.88 6.63 2.84 50.30**
Utility (UT) 7.72 2.28 4.88 2.96 100.60**
Knowledge (KNOW) 7.92 2.00 5.83 3.03 59.00**
Long-term evaluation scalesb n = 299 n = 99
Application to practice (APP) 6.55 2.44 3.12 2.60 141.37**
Individual org. results (I-OR) 5.09 2.92 2.24 2.55 72.92**
Global org. results (G-OR) 3.60 2.91 1.51 2.18 41.38**
Note: MANCOVA, missing listwise.

* p < 0.05, ** p < 0.01.
a
We included organizational tenure, time lag between training and survey, course duration as
well as type of training content as covariates in our MANCOVA analysis of short-term outcomes.
As type of training content was an ordinal variable (three categories: 1 = closed skills, 2 = both (open
and closed skills) and 3 = open skills), it was included as a dummy-coded covariate with closed skills
as the reference category: Pillai’s trace (org. tenure) = 0.01, F(3, 392) = 0.92, p = 0.43; Pillai’s trace
(time lag) = 0.02, F(3, 392) = 3.14, p < 0.05; Pillai’s trace (duration) = 0.01, F(3, 392) = 1.32, p = 0.27;
Pillai’s trace (type of training content: both open and closed skills) = 0.02, F(3, 392) = 2.09, p = 0.10;
Pillai’s trace (type of training content: open skills) = 0.05, F(3, 392) = 6.63, p < 0.01.
b
We included organizational tenure, course duration and dummy-coded type of training content
as covariates in our MANCOVA analysis of long-term outcomes: Pillai’s trace (org. tenure) = 0.02,
F(3, 390) = 2.06, p = 0.11; Pillai’s trace (duration) = 0.02, F(3, 390) = 2.66, p < 0.05; Pillai’s trace
(type of training content: both open and closed skills) = 0.01, F(3, 390) = 1.10, p = 0.35; Pillai’s trace
(type of training content: open skills) = 0.02, F(3, 390) = 2.13, p = 0.10.
M = mean, SD = standard deviation.
in our MANCOVA analysis of the long-term outcomes. In our regression analysis,

we included age, organizational tenure and course duration as covariates because
we found significant correlations with our dependent variable transfer quantity (see
Table 3).
Addressing research question 2, we tested group differences between employees
who could transfer training contents to practice and employees who could not by
means of two separate MANCOVA analyses. We found significant group differences
(see Table 4) for all short-term (Pillai‘s trace = 0.21, F(3, 392) = 33.93, p < 0.01) and for all
long-term evaluation scales (Pillai’s trace = 0.27, F(3, 390) = 46.89, p < 0.01). Descriptive
statistics showed that employees who could transfer training contents to practice
showed higher values on all Q4TE scales (short- and long-term evaluation scales) than
employees who could not.
The investigation of transfer quantity provided a more differentiated picture
(see Table 5). Addressing research question 3, a multiple regression analysis showed
that only the Q4TE scale application to practice had a significant positive relationship
with transfer quantity (b = 0.26, p < 0.01).
General discussion and conclusions

The present paper focused on the development and psychometric investigation of a
summative evaluation questionnaire. The Q4TE meets the following requirements for

Table 5: Results of multiple regression analysis in study 3
Transfer quantity (TQ)
Step 1 Step 2
b b
Covariates
Age 0.13 0.13*
Organizational tenure 0.06 0.00
Course duration 0.08 0.03
Correlates (independent variables)
Satisfaction (SAT) – 0.12
Utility (UT) – -0.05
Knowledge (KNOW) – 0.05
Application to practice (APP) – 0.26**
Individual org. results (I-OR) – 0.11
Global org. results (G-OR) – 0.03
R2 0.04 0.22
R2adj 0.03 0.20
F 4.88** 11.63**
Note: Multiple regression analysis using the enter method (n = 391), missing listwise.
b = standardized regression coefficient. We included age, organizational tenure, and course dura-
tion as covariates in step 1, and all Q4TE variables in step 2.
* p < 0.05, ** p < 0.01 (2-sided significance).
professional training evaluation measures: it covers short- and long-term training out-
comes (cf. Wang & Wilcox, 2006) and provides high usability in terms of time efficiency
(cf. Aguinis & Kraiger, 2009). Moreover, it shows promising psychometric properties
(cf. Aiken & Groth-Marnat, 2006).
Our analyses yielded a time-efficient measure for summative training evaluation
that is generalizable to diverse training contents and contexts. We established sound
psychometric properties and demonstrated good or at least satisfactory internal con-
sistency values for all Q4TE scales (Nunnally & Bernstein, 1994, p. 265). In study 1,
the final Q4TE form was successfully identified by means of EFA and CFA. In study
2, CFA results clearly support a model with six latent factors (satisfaction, utility,
knowledge, application to practice, individual organizational results and global
organizational results) over two competing models (following either Wang & Wilcox,
2006 or Kirkpatrick & Kirkpatrick, 2006). Addressing our first research question,
study 2 results underscore the importance of distinguishing single training
outcomes (e.g. satisfaction and utility). However, if one has to aggregate evaluation
data on a higher level in future studies (e.g. if a model is otherwise too complex),
EFA results in study 1 clearly indicate that the distinction between short- and long-
term evaluation following Wang and Wilcox (2006) is appropriate for aggregation. By
contrast, study 2 results revealed no sufficient fit for a model in line with Wang and
Wilcox (2006). Moreover, we found no sufficient fit for a model following Kirkpatrick
and Kirkpatrick (2006), except for an acceptable SRMR value for both models. Yet, the
detailed investigation of the CFA results showed slight model improvements for the
model in line with Wang and Wilcox (2006) compared with Kirkpatrick‘s (1967)
framework. In sum, our analyses clearly support a six-factor solution (satisfaction,
utility, knowledge, application to practice, individual organizational results and
global organizational results) and hint at the appropriateness of distinguishing
between short- and long-term outcomes (cf. Wang & Wilcox, 2006) if aggregating
evaluation data is necessary.

Addressing our second research question, we found significant differences between
participants who could transfer training contents to practice and those who could not.
Participants who report having managed to transfer training contents showed higher
values on all Q4TE scales compared with participants who report not to have done so.
These findings provide first hints at the differential validity of the Q4TE and show that
the Q4TE successfully differentiates between persons who transfer and those who do
not. Addressing our third research question, we established a relationship between the
Q4TE scale application to practice and transfer quantity. This finding lends support to
the discriminant validity of the Q4TE because only the application to practice scale was
linked to transfer quantity. This is in line with previous theorizing, as both are assumed
to measure transfer, which refers to behavior change back at work (e.g. Aguinis &
Kraiger, 2009; Kauffeld et al., 2008). In sum, our analyses hint at the construct validity
of the Q4TE.
The Q4TE heeds the call for more efficient tools for training evaluation in a fast-
moving business environment (Aguinis & Kraiger, 2009). It allows for summative
training evaluation in a time-efficient and psychometrically sound manner. Unlike
training-specific evaluation measures (for examples, see Salas & Cannon-Bowers,
2001), the Q4TE permits comparisons of training courses within and between organi-
zations. While being applicable to a wide variety of training courses, the Q4TE offers
valuable information beyond the reaction level by addressing short- and long-term
outcomes (Wang & Wilcox, 2006) and covering participants’ reactions, learning, trans-
fer and organizational outcomes (Kirkpatrick & Kirkpatrick, 2006). In line with previ-
ous theorizing (e.g. Alliger et al., 1997), our results underscore the distinction between
single scales (e.g. satisfaction and utility). Moreover, the present results provide first
hints at the differential and discriminant validity of the Q4TE.
Limitations
The present study has several limitations. First, the psychometric examination of the
Q4TE relied entirely on computer-based, cross-sectional, retrospective samples. As all
scales were measured at the same level of specificity and at the same time, higher
intercorrelations between the Q4TE scales are observed in contrast to values reported
in several meta-analyses based on Kirkpatrick’s four-level framework (see Alliger et al.,
1997). To reduce the potential bias inherent in the present research design, future
research should include a time lag between the short-term (e.g. satisfaction) and long-
term evaluation scales (e.g. application to practice; see Podsakoff et al., 2012). However,
the retrospective online samples used in the three studies offered the opportunity to
obtain three diverse data sets from different organizations and training programs,
while avoiding missing data. These design characteristics were important for our
research aim of developing an inventory that is not training-specific, but widely appli-
cable to professional training evaluation.
Second, the Q4TE consists of self-report items only, which can be a source of common
method bias (e.g. Podsakoff et al., 2012). One possibility to deal with common method
bias is to apply subsequent statistical procedures (for an overview, see Podsakoff et al.,
2012). However, to date, there is still a scientific debate on whether and how to apply
statistical procedures for dealing with common method bias (e.g. Conway & Lance,
2010). As Conway and Lance (2010) pointed out, ‘[n]o post hoc statistical correction
procedure can be recommended until additional research evaluates the relative effec-
tiveness of those that have been proposed’ (p. 332). Furthermore, assessing level 3
(behavior) and level 4 (organizational results) by means of self-reports runs contrary
to some recommendations, e.g. to use behavioral observations to measure level 3
(e.g. Wang & Wilcox, 2006). However, using self-report measures seems appropriate
because the participant himself or herself is widely regarded as a valid data source for
many psychological constructs (Spector, 2006). For example, several studies have
shown that self-report measures reflect specific learning outcomes appropriately (for
an overview, see Kraiger et al., 1993). Furthermore, using standardized self-report
questionnaires is the only possibility to get a quick overview over organization-wide

training evaluation. However, future studies should combine the Q4TE with analyses
of return on investment data, training-specific tests or peer ratings of training benefits
to allow for more in-depth evaluation.
Third, each of the Q4TE scales comprises two items only. Some researchers have
proposed that questionnaires should generally comprise scales with four to six items
each (e.g. Hinkin, 1998). Moreover, Credé et al. (2012) have pointed out disadvantages
of short questionnaires, for example, that short measures cannot reflect constructs that
are complex and cover a wide range of contents (see also Loo, 2002). However, short
measures also provide key benefits, especially for practitioners. For example, they are
easy to apply and provide a time-efficient method for collecting data in large samples
(Loo, 2002). Especially for practice but also for field research, there often is only the
choice between applying a short measure and not collecting data due to time con-
straints (as detailed for personality measures in Credé et al., 2012). Moreover, previous
empirical studies have underscored that short measures can be an appropriate tool for
measuring psychological variables (e.g. Bergkvist & Rossiter, 2007). When examining
the psychometric properties of single- and two-item measures, the latter could clearly
outperform the first (Credé et al., 2012). Thus, our somewhat minimalistic approach
with only two items per scale served our goal of developing a time-efficient and widely
applicable questionnaire well. Furthermore, internal consistency values of the Q4TE
are comparable to the initial Q4TE form reported by Kauffeld et al. (2009), which is
noteworthy with respect to the small number of items per scale.
Fourth, all analyses were based on German samples. However, our psychometric
investigations were based on diverse samples from different organizations to allow for
more general conclusions. Future studies should explore whether the present findings
could be generalized to other cultural backgrounds.
Implications for practice and future research

To gather comparable results of summative evaluation, the easiest and most time-
efficient way is to implement a standard evaluation questionnaire. Grounding on a
model widely used in practice (cf. Kirkpatrick & Kirkpatrick, 2006; Wang & Wilcox,
2006), the Q4TE readily addresses the factual needs of practitioners who are facing high
time constraints while showing good psychometric properties. For example, if an
organization implements spaced (i.e. time intervals between training sessions) and
massed trainings (i.e. no time interval between training sessions; see, e.g. Hesketh,
1997), the Q4TE can reveal important differences between both training types. Partici-
pants of the massed training might report higher learning outcomes at first but show a
lower amount of transfer to practice. By contrast, participants of the spaced training
might report lower learning outcomes at first, but show a higher amount of transfer
because they had time to practice. In this situation, the Q4TE can provide hints at which
training shows a higher amount of transfer and is thereby more beneficial for the
organization (for a recent field study on massed vs. spaced trainings, see Kauffeld &
Lehmann-Willenbrock, 2010). In our example, training professionals can reveal the lack
of transfer in massed training using the Q4TE and can integrate transfer-enhancing
techniques (e.g. include more job-related exercises; cf. Kauffeld et al., 2008) or adapt the
training design (e.g. include intervals between training sessions) to improve training
transfer.
Although developed for practical demands, the Q4TE may also be a valuable tool for
field researchers facing time constraints. The brevity of the Q4TE facilitates larger
studies on training transfer models in which training outcomes are only a minor
part of the investigation. Models incorporating both summative evaluation and
transfer-related factors allow determining whether and why a training program works
(e.g. Holton, 2005). Combining the Q4TE with measures of transfer-related variables
(e.g. Learning Transfer System Inventory; Holton et al., 2000) will enable field research-
ers to examine more elaborated transfer models (e.g. Baldwin & Ford, 1988; Holton,
2005). The brevity of the Q4TE facilitates the investigation of possible moderators
(e.g. training content or participant features like hierarchical position) and mediators

(e.g. motivation to transfer) in order to get a more thorough understanding of deter-
minants of successful training transfer (e.g. Blume et al., 2010; Giangreco et al., 2010;
Grossman & Salas, 2011; Kauffeld et al., 2008). Moreover, training courses may be
compared within and between organizations using the Q4TE in future studies. In
doing so, researchers can take the multilevel structure of data into account to reveal
potential organization-level or training-level effects.
The Q4TE provides valuable information for comparing benefits of training courses
in a time-efficient manner. However, to provide information for individual assessment
(e.g. concerning the individual participant), we strongly recommend combining the
Q4TE with other evaluation measures. For example, the learning level may be com-
bined with knowledge tests (e.g. Wang & Wilcox, 2006). The assessment of training
transfer can be enhanced by behavioral ratings on the job (for examples, see Salas &
Cannon-Bowers, 2001), and the organizational level may be extended by return on
investment analysis (e.g. Phillips, 1997).
In sum, the Q4TE is a widely applicable training evaluation questionnaire with
sound psychometric properties, that addresses both short- and long-term evaluation
outcomes (cf. Wang & Wilcox, 2006). Due to its brevity, the Q4TE provides valuable and
standardized information for evaluating training benefits while being easy to imple-
ment in organizational practice.
References
Aguinis, H. and Kraiger, K. (2009), ‘Benefits of training and development for individuals and
teams, organizations, and society’, Annual Review of Psychology, 60, 451–74.
Aiken, L. R. and Groth-Marnat, G. (2006), Psychological Testing and Assessment, 12th edn (Boston,
MA: Pearson Education).
Alliger, G. M. and Janak, E. A. (1989), ‘Kirkpatrick’s levels of training criteria: thirty years later’,
Personnel Psychology, 42, 331–42.
Alliger, G. M., Tannenbaum, S. I., Bennett, W. Jr, Traver, H. and Shotland, A. (1997), ‘A meta-
analysis of the relations among training criteria’, Personnel Psychology, 50, 341–58.
Alvarez, K., Salas, E. and Garofano, C. M. (2004), ‘An integrated model of training evaluation and
effectiveness’, Human Resource Development Review, 3, 385–416.
Arthur, W. Jr, Bennett, W. Jr, Edens, P. S. and Bell, S. T. (2003), ‘Effectiveness of training in
organizations: a meta-analysis of design and evaluation features’, Journal of Applied Psychology,
88, 234–45.
Baldwin, T. T. and Ford, J. K. (1988), ‘Transfer of training: a review and directions for future
research’, Personnel Psychology, 41, 63–105.
Bates, R. A. (2004), ‘A critical analysis of evaluation practice: the Kirkpatrick model and the
principle of beneficence’, Evaluation and Program Planning, 27, 341–7.
Bentler, P. M. and Chou, C.-P. (1987), ‘Practical issues in structural modeling’, Sociological Methods
& Research, 16, 78–117.
Bergkvist, L. and Rossiter, J. R. (2007), ‘The predictive validity of multiple-item versus single-item
measures of the same constructs’, Journal of Marketing Research, 44, 175–84.
Bihler, W. (2006), Weiterbildungserfolg in betrieblichen Lehrveranstaltungen: Messung und Einflussfak-
toren im Bereich Finance & Controlling [Success of Advanced Training in Operating Courses:
Measurement and Determinants in Finance and Controlling] (Wiesbaden: Dt. Univ.-Verlag).
Blau, G., Gibson, G., Bentley, M. and Chapman, S. (2012), ‘Testing the impact of job-related
variables on a utility judgment training criterion beyond background and affective reaction
variables’, International Journal of Training and Development, 16, 54–66.
Blume, B. D., Ford, J. K., Baldwin, T. T. and Huang, J. L. (2010), ‘Transfer of training: a meta-
analytic review’, Journal of Management, 36, 1065–105.
Broad, M. L. (1997), ‘Overview of transfer of training: from learning to performance’, Performance
Improvement Quarterly, 10, 2, 7–21.
Brown, K. G. (2005), ‘An examination of the structure and nomological network of trainee
reactions: a closer look at “smile sheets” ’, Journal of Applied Psychology, 90, 991–1001.
Byrne, B. M. (2005), ‘Factor analytic models: viewing the structure of an assessment instrument
from three perspectives’, Journal of Personality Assessment, 85, 17–32.
Combs, J., Liu, Y., Hall, A. and Ketchen, D. (2006), ‘How much do high-performance work
practices matter? A meta-analysis of their effects on organizational performance’, Personnel
Psychology, 59, 501–28.

Conway, J. M. and Lance, C. E. (2010), ‘What reviewers should expect from authors regarding
common method bias in organizational research’, Journal of Business and Psychology, 25, 325–34.
Credé, M., Harms, P. D., Niehorster, S. and Gaye-Valentine, A. (2012), ‘An evaluation of the
consequences of using short measures of the Big Five personality traits’, Journal of Personality
and Social Psychology, 102, 874–88.
Deschler, S., Mandl, H. and Winkler, K. (2005), Konzeption, Entwicklung und Evaluation einer
video- und textbasierten virtuellen Lernumgebung für eine Bundesbehörde [Conceptualiza-
tion, development and evaluation of a video- and text-based virtual learning environment for
a governmental organization]. Practice report no. 31 (LMU München: Institut für Pädagogische
Psychologie). Available at http://epub.ub.uni-muenchen.de/690/1/Praxisbericht31.pdf
(accessed 20 September 2012).
Devos, C., Dumay, X., Bonami, M., Bates, R. A. and Holton, E. F. III (2007), ‘The Learning Transfer
System Inventory (LTSI) translated into French: internal structure and predictive validity’,
International Journal of Training and Development, 11, 181–99.
Drolet, A. L. and Morrison, D. G. (2001), ‘Do we really need multiple-item measures in service
research?’, Journal of Service Research, 3, 196–204.
Giangreco, A., Carugati, A. and Sebastiano, A. (2010), ‘Are we doing the right thing? Food for
thought on training evaluation and its context’, Personnel Review, 39, 162–77.
Gnefkow, T. (2008), Lerntransfer in der betrieblichen Weiterbildung: Determinanten für den
Erfolg externer betrieblicher Weiterbildungen im Lern- und Funktionsfeld aus Teilnehmerper-
spektive [Learning transfer in operative training: Determinants for the success of external
advanced training in the learning and function field from the participants’ perspective].
Unpublished doctoral dissertation, Bielefeld University. Available at http://d-nb.info/
989349004/34 (accessed 20 September 2012).
Green, M. and McGill, E. (2011), State of the Industry, 2011 (Alexandria, VA: American Society for
Training and Development), Available at http://www.astd.org/Publications/Research-
Reports/2011/2011-State-of-the-Industry-Report (accessed 20 September 2012).
Grossman, R. and Salas, E. (2011), ‘The transfer of training: what really matters’, International
Journal of Training and Development, 15, 103–20.
Hesketh, B. (1997), ‘Dilemmas in training for transfer and retention’, Applied Psychology: An
International Review, 46, 317–39.
Hinkin, T. R. (1998), ‘A brief tutorial on the development of measures for use in survey
questionnaires’, Organizational Research Methods, 1, 104–21.
Holton, E. F. III (2005), ‘Holton’s evaluation model: new evidence and construct elaborations’,
Advances in Developing Human Resources, 7, 37–54.
Holton, E. F. III, Bates, R. A. and Ruona, W. E. A. (2000), ‘Development of a generalized learning
transfer system inventory’, Human Resource Development Quarterly, 11, 333–60.
Ironson, G. H., Smith, P. C., Brannick, M. T., Gibson, W. M. and Paul, K. B. (1989), ‘Construction
of a job in general scale: a comparison of global, composite, and specific measures’, Journal of
Applied Psychology, 74, 193–200.
Kauffeld, S., Bates, R. A., Holton, E. F. III and Müller, A. C. (2008), ‘Das deutsche Lerntransfer-
System-Inventar (GLTSI): Psychometrische Überprüfung der deutschsprachigen Version [The
German version of the Learning Transfer System Inventory (GLTSI): psychometric validation]’,
Zeitschrift Für Personalpsychologie, 7, 50–69.
Kauffeld, S., Brennecke, J. and Strack, M. (2009), ‘Erfolge sichtbar machen: Das Maßnahmen-
Erfolgs-Inventar (MEI) zur Bewertung von Trainings [Visualizing Training Outcomes: The MEI
for Training Evaluations]’, in S. Kauffeld, S. Grote and E. Frieling (eds), Handbuch
Kompetenzentwicklung (Stuttgart: Schäffer-Poeschel), pp. 55–78.
Kauffeld, S. and Lehmann-Willenbrock, N. (2010), ‘Sales training: effects of spaced practice on
training transfer’, Journal of European Industrial Training, 34, 23–37.
Kaufman, R., Keller, J. and Watkins, R. (1996), ‘What works and what doesn’t: evaluation beyond
Kirkpatrick’, Performance and Instruction, 35, 2, 8–12.
Kirkpatrick, D. L. (1967), ‘Evaluation of Training’, in R. L. Craig and L. R. Bittel (eds), Training
and Development Handbook: A Guide to Human Resource Development (New York: McGraw-Hill),
pp. 87–112.
Kirkpatrick, D. L. and Kirkpatrick, J. D. (2006), Evaluating Training Programs: The Four Levels, 3rd
edn (San Francisco, CA: Berrett-Koehler).
Kline, P. (1986), A Handbook of Test Construction: Introduction to Psychometric Design (London:
Methuen).
Kraiger, K. (2002), ‘Decision-based Evaluation’, in K. Kraiger (ed.), Creating, Implementing, and
Managing Effective Training and Development: State-of-the-Art Lessons for Practice (San Francisco,
CA: Jossey-Bass), pp. 331–75.

Kraiger, K., Ford, J. K. and Salas, E. (1993), ‘Application of cognitive, skill-based, and affective
theories of learning outcomes to new methods of training evaluation’, Journal of Applied
Laker, D. R. and Powell, J. L. (2011), ‘The differences between hard and soft skills and their
relative impact on training transfer’, Human Resource Development Quarterly, 22, 111–22.
Lavelle, J. J., Rupp, D. E. and Brockner, J. (2007), ‘Taking a multifoci approach to the study
of justice, social exchange, and citizenship behavior: the target similarity model’, Journal of
Management, 33, 841–66.
Lim, D. H. and Morris, M. L. (2006), ‘Influence of trainee characteristics, instructional satisfaction,
and organizational climate on perceived learning and training transfer’, Human Resource Devel-
opment Quarterly, 17, 85–115.
Linton, M. (1982), ‘Transformations of Memory in Everyday Life’, in U. Neisser (ed.), Memory
Observed: Remembering in Natural Contexts (San Francisco, CA: Freeman), pp. 77–91.
Loo, R. (2002), ‘A caveat on using single-item versus multiple-item scales’, Journal of Managerial
Martin, H. J. (2010), ‘Workplace climate and peer support as determinants of training transfer’,
Human Resource Development Quarterly, 21, 87–104.
Mathieu, J. E., Tannenbaum, S. I. and Salas, E. (1992), ‘Influences of individual and situational
characteristics on measures of training effectiveness’, Academy of Management Journal, 35, 828–
47.
May, G. L. and Kahnweiler, W. M. (2000), ‘The effect of a mastery practice design on learning and
transfer in behavior modeling training’, Personnel Psychology, 53, 353–73.
Morgan, R. B. and Casper, W. J. (2000), ‘Examining the factor structure of participant reactions
to training: a multidimensional approach’, Human Resource Development Quarterly, 11, 301–
17.
Muthén, L. K. and Muthén, B. O. (1998–2010), Mplus User’s Guide, 6th edn (Los Angeles, CA:
Muthén & Muthén).
Nickols, F. W. (2005), ‘Why a stakeholder approach to evaluating training’, Advances in Developing
Human Resources, 7, 121–34.
Nunnally, J. C. and Bernstein, I. H. (1994), Psychometric Theory, 3rd edn (New York: McGraw-Hill).
Ong, C.-S., Lai, J.-Y. and Wang, Y.-S. (2004), ‘Factors affecting engineers’ acceptance of asynchro-
nous e-learning systems in high-tech companies’, Information & Management, 41, 795–804.
Pershing, J. A. and Pershing, J. L. (2001), ‘Ineffective reaction evaluation’, Human Resource Devel-
opment Quarterly, 12, 73–90.
Phillips, J. J. (1997), Return on Investment in Training and Performance Improvement Programs (Houston,
TX: Gulf Publishing).
Podsakoff, P. M., MacKenzie, S. B. and Podsakoff, N. P. (2012), ‘Sources of method bias in social
science research and recommendations on how to control it’, Annual Review of Psychology, 63,
539–69.
Rammstedt, B. (2007), ‘The 10-item Big Five Inventory (BFI-10): norm values and investigation of
socio-demographic effects based on a German population representative sample’, European
Journal of Psychological Assessment, 23, 193–201.
Rammstedt, B. and John, O. P. (2007), ‘Measuring personality in one minute or less: a 10-item
short version of the Big Five Inventory in English and German’, Journal of Research in
Personality, 41, 203–12.
Reed, A. (2001), Innovation in Human Resource Management: Tooling up for the Talent Wars (London:
CIPD).
Roche, W. K., Teague, P., Coughlan, A. and Fahy, M. (2011), Human Resources in the Recession:
Managing and Representing People at Work in Ireland (Dublin: Government Publications), Available
at http://www.ucd.ie/t4cms/Human%20Resources%20in%20the%20Recession%20Book
%20Manuscript.pdf (accessed 20 September 2012).
Saks, A. M. and Burke, L. A. (2012), ‘An investigation into the relationship between training
evaluation and the transfer of training’, International Journal of Training and Development, 16,
118–27.
Salas, E. and Cannon-Bowers, J. A. (2001), ‘The science of training: a decade of progress’, Annual
Review of Psychology, 52, 471–99.
Schermelleh-Engel, K., Moosbrugger, H. and Müller, H. (2003), ‘Evaluating the fit of structural
equation models: tests of significance and descriptive goodness-of-fit measures’, Methods
of Psychological Research Online, 8, 2, 23–74.
Schermuly, C. C., Schröder, T., Nachtwei, J., Kauffeld, S. and Gläs, K. (2012), ‘Die Zukunft der
Personalentwicklung. Eine Delphi-Studie [The future of human resource development – a
Delphi study]’, Zeitschrift für Arbeits- und Organisationspsychologie, 56, 111–22.

Schweizer, K. (2010), ‘Some guidelines concerning the modeling of traits and abilities in test
construction’, European Journal of Psychological Assessment, 26, 1–2.
Schyns, B. and von Collani, G. (2002), ‘A new occupational self-efficacy scale and its relation to
personality constructs and organizational variables’, European Journal of Work and Organizational
Spector, P. E. (2006), ‘Method variance in organizational research: truth or urban legend?’,
Organizational Research Methods, 9, 221–32.
Stoughton, J. W., Gissel, A., Clark, A. P. and Whelan, T. J. (2011), ‘Measurement invariance in
training evaluation: old question, new context’, Computers in Human Behavior, 27, 2005–10.
Tracey, J. B., Hinkin, T. R., Tannenbaum, S. and Mathieu, J. E. (2001), ‘The influence of individual
characteristics and the work environment on varying levels of training outcomes’, Human
Resource Development Quarterly, 12, 5–23.
Wang, G. G. and Wilcox, D. (2006), ‘Training evaluation: knowing more than is practiced’,
Advances in Developing Human Resources, 8, 528–39.

View publication stats

2013 GrohmannKauffeld Evaluatingtrainingprograms

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

2013 GrohmannKauffeld Evaluatingtrainingprograms

Загружено:

Авторское право:

Доступные форматы

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Evaluating Training Programs: Development and Correlates of the

Article in International Journal of Training and Development · June 2013

EnEff Campus 2020: Integral energetic masterplan View project

The user has requested enhancement of the downloaded file.

International Journal of Training and Development 17:2

Evaluating training programs:

Anna Grohmann and Simone Kauffeld

Psychometrically sound evaluation measures are vital for

❒ Anna Grohmann, Research Associate, Department of Industrial/Organizational and Social

© 2013 Blackwell Publishing Ltd.

Development and correlates of the Q4TE 135

136 International Journal of Training and Development

Benefits of training evaluation

Models of training evaluation

Development and correlates of the Q4TE 137

Training evaluation surveys

138 International Journal of Training and Development

Behavior APP: Application to practice

Current research questions

Underlying factor structure

Development and correlates of the Q4TE 139

Differential and discriminant validity

Study 1 – questionnaire development

140 International Journal of Training and Development

Preliminary analysis. Multi-item questionnaires are expensive while hardly offer-

Development and correlates of the Q4TE 141

Scalea Item wording References

142 International Journal of Training and Development

Short-term evaluation. We specified a model with three latent, intercorrelated factors

Long-term evaluation. A model with three latent, intercorrelated factors (application to

Development and correlates of the Q4TE 143

Model No. of c2 d.f. c2/d.f. RMSEA CFI SRMR

SAT = satisfaction, UT = utility, KNOW = knowledge, APP = application to practice, I-OR =

144 International Journal of Training and Development

Results and discussion

Model 1 Model 2 Model 3

Development and correlates of the Q4TE 145

Study 3 – differential and discriminant validity of the Q4TE

Results and discussion

146 International Journal of Training and Development

Scales M SD SAT UT KNOW APP I-OR G-OR T TQ

Satisfaction (SAT) 7.99 2.31 (0.91)

© 2013 Blackwell Publishing Ltd.

Scales Transfer: Yes Transfer: No Transfer: Yes/No

Short-term evaluation scalesa n = 300 n = 101

Note: MANCOVA, missing listwise.

in our MANCOVA analysis of the long-term outcomes. In our regression analysis,

General discussion and conclusions

148 International Journal of Training and Development

Transfer quantity (TQ)

Development and correlates of the Q4TE 149

150 International Journal of Training and Development

Implications for practice and future research

Development and correlates of the Q4TE 151

152 International Journal of Training and Development

Development and correlates of the Q4TE 153

154 International Journal of Training and Development

Development and correlates of the Q4TE 155

View publication stats

Вам также может понравиться