Вы находитесь на странице: 1из 42

PERSONNEL PSYCHOLOGY

2010, 63, 999–1039

PREDICTING LONG-TERM FIREFIGHTER


PERFORMANCE FROM COGNITIVE AND PHYSICAL
ABILITY MEASURES
NORMAN D. HENDERSON
Department of Psychology
Oberlin College

Firefighters from 1 academy training class were observed for 23 years,


beginning with their selection test consisting of a g-saturated written
exam (GCA) and firefighting simulations loaded on a strength/endurance
(SE) factor. Operational validity coefficients for both GCA and SE were
high for training success and remained consistently high for job per-
formance ratings throughout the study. The operational validity for
combined GCA and SE predictors was .86 for a composite job rat-
ing measure covering 21 years of service. A structural model produced
similar results for more broadly defined GCA and SE latent variables.
Both analyses suggested approximately equal weighting for GCA and
SE for a fire service selection test. Results indicate considerable latitude
in choosing cognitive and physical predictors for firefighter screening if
the predictors are highly loaded on GCA and SE.

In the first volume of the Journal of Applied Psychology, Lewis


Terman (1917) reported “intelligence examination” scores and “peda-
gogical examination” scores of candidates for positions in the San Jose
Fire Department. Scores on the examinations were combined and rank or-
dered for use, along with a test of strength and endurance, an assessment
of “personal and moral qualities,” and a medical examination, for hiring.
Terman did not follow those hired to determine the predictive validity of
any measures but indicated the desirability of correlating the results of
mental tests with later success of the accepted candidates.
The first firefighter selection paper to do this appeared in that journal
34 years later. Wolff and North (1951) reported a correlation of .30 (N =
144) between individual supervisor rankings of firefighters and a selec-
tion test based on general knowledge, math, and mechanical principles.
No other papers had appeared in the journal when Campbell (1982), as
outgoing editor, lamented the absence of research submitted on police
and firefighter selection. The Wolff and North paper remains the only

Correspondence and requests for reprints should be addressed to Norman D. Henderson,


Department of Psychology, Severance Laboratory, Oberlin College, Oberlin, OH 44074,
USA; nhenders@oberlin.edu.

C 2010 Wiley Periodicals, Inc.

999
1000 PERSONNEL PSYCHOLOGY

fire service validity study of cognitive variables published in a major


industrial-organizational (I-O) journal. Although this dearth may in part
be due to the general decline in publication of validation studies, it also
appears to be a result of test developers’ increasing avoidance of criterion-
related validation strategies for firefighter screening tests. Approximately
3,500 career fire departments in the USA use screening procedures to
assess one or more of the cognitive, physical, medical, and biodata do-
mains mentioned by Terman. Despite this, an aggressive attempt to locate
all criterion-related firefighter validation studies completed between 1958
and 1998 yielded only 24 reports of cognitive-based firefighter selec-
tion tests (Barrett, Polomsky, & McDaniel, 1999). All were unpublished
reports involving pre-1986 data. Available criterion-based validation re-
search on physical abilities from I-O psychology is even more limited,
although some relevant studies come from occupational health/medicine
and exercise physiology. Undeniably, the “file drawer problem” is pro-
nounced with respect to locating and obtaining criterion-based validation
studies of firefighter selection. The research is rarely published and test
developers are often reluctant or unable to share technical reports.
Avoidance of criterion-based validation strategies is understandable
because predictor range restriction, unreliable criterion variables, incom-
plete or outlier data, and uncooperative incumbents can contribute to
sizeable attenuation of true validity coefficients (e.g., Hunter, Schmidt, &
Le, 2006; Wilcox, 2001). In addition, firefighting depends heavily on two
largely uncorrelated abilities, limiting the variance that can be explained
by each. Because cognitive and physical predictors must be validated, sta-
tistical power for each predictor must be sufficient to maintain adequate
joint power. Even when technically feasible, criterion-based validation
can be costly, and test security usually prevents local validation prior to
test administration. Content-based validation avoids these problems and
job-sample tests are simpler to understand and perceived to be more job
related and fair by applicants and courts than general tests of abilities
(e.g., Ryan, Greguras, & Ployhart, 1996).
Content-based tests and their validation are not without problems in
fire service selection. The intuitive nature of job-content tests can spawn
“instant experts” within municipal administrations and in the courts. As-
sessing the fidelity of the test content to actual work behaviors can become
particularly problematic. Safety and training considerations and practical
limitations make it difficult to maintain high job fidelity in simulated
physical tasks. Some simulations may end up differing from their ac-
tual job counterparts in their underlying physical demands, and in some
task simulations, technique can account for a large portion of observed
score variance. Although concepts and operations related to firefighting
can be incorporated into cognitive selection tests, these tests primarily
NORMAN D. HENDERSON 1001

emulate training activities rather than on-the-job knowledge-based deci-


sion making under stress. Content-based validity studies are also vulner-
able to challenges to the weights assigned to various test components
because there is no explicit quantitative relationship between component
weights and job analysis importance ratings. An ability rated highly im-
portant in a job analysis will account for a trivial portion of test score
variance if most applicants possess the ability at an adequate level, for
example. For a stark picture of the vulnerabilities of using a content-based
validation strategy to demonstrate the job relatedness of g-loaded fire-
fighter screening tests, see U. S. v. The City of New York (2009). In that
situation, multiple opportunities to employ very large sample criterion-
and construct-based validation strategies to support the use of the tests
were apparently passed up.
From a broader I-O perspective, the development and use of content-
based fire service selection tests without a criterion-based validation
follow-up is troubling. The paucity of accessible criterion-related re-
search on a critical and dangerous occupation, where many local selec-
tion procedures are developed and often litigated, has significant public
policy implications. Criterion-based studies provide a quantitative frame-
work for personnel and policy decisions that cannot be derived from
content-based studies. Operational validity coefficients that have been ad-
justed for two statistical artifacts—range restriction and criterion measure
unreliability—play a central role in assessing the utility and subsequently
the costs and benefits of alternative selection strategies. Operational coef-
ficients also provide the quantitative basis for component weights in the
multidimensional selection procedures used in firefighter selection.
The evolution of our understanding of the attenuating effects of range
restriction, criterion unreliability, and imperfect construct validity on ob-
served validity coefficients (e.g., Hunter & Schmidt, 2004; Mendoza,
Bard, Mumford, & Ang, 2004; Mendoza & Mumford, 1987; Sackett &
Yang, 2000; Stauffer & Mendoza, 2001) and the development of optimal
procedures for correcting for these artifacts (Hunter et al., 2006) has led to
the realization that artifact-corrected operational validity coefficients for
general cognitive ability (GCA) predictors are quite high for both training
and work performance for skilled and medium complexity jobs (Hunter
et al., 2006; Schmidt, Shaffer, & Oh, 2008).
A similar situation is emerging with respect to strength-related phys-
ical abilities in job performance and reduced injuries in physically de-
manding occupations (e.g., Arvey, Nutting, & Landon, 1992; Beaton,
Murphy, Salazar, & Johnson, 2002; Blakley, Quinones, Crawford, & Jago,
1994; Harbin & Olson, 2005; Henderson, Berry, & Matic, 2007; Hoffman,
1999; Hogan, 1991a, 1991b; Jackson, 1994; Sothmann, Gebhardt, Baker,
Kastello, & Sheppard, 2004; Stevenson, Weber, Smith, Dumas, & Albert,
1002 PERSONNEL PSYCHOLOGY

2001). Henderson et al. (2007) summarized this and related research,


which showed high validity coefficients for strength-related measures
in predicting performance ratings for many physically demanding jobs.
Even after corrections for unreliability, validity coefficients for predicting
performance on objectively measured work sample tasks were generally
higher than job ratings.
In large city fire service jobs, both the technical training demands and
on-the-job fire/rescue decision making requirements provide strong theo-
retical reasons to expect heavily g-loaded selection tests to have high oper-
ational validity throughout a firefighter’s career. Fire academy training and
job success depend on the ability to master a breadth of knowledge and
understand some essential engineering, science, and medical principles
that can be applied in widely diverse fire and rescue situations. Achieve-
ment scores based on a wide breadth of subject matter correlate highly
with intelligence tests (e.g., Cronbach, 1975; Lubinski, 2004). The Barrett
et al. (1999) meta-analysis of cognitive predictors of firefighting success
supports the expectation of their high validity, finding artifact-corrected
operational validity coefficients of approximately .77 for training success
and .49 for job performance. We argue further that these coefficients are
underestimates.
In a parallel manner, the physical demands required of firefighters,
along with the high correlations typically observed among various strength
and endurance measures and firefighting/rescue task simulations (e.g.,
Blakley et al., 1994; Davis, Dotson, & Santa Maria, 1982; Henderson
et al., 2007; Rhea, Alvar, & Gray, 2004; Sothmann et al., 2004), pro-
vide abundant evidence that heavily loaded strength/endurance physical
screening tests have high operational validity for selecting firefighters.
There also appears to be a general physical abilities job performance fac-
tor in firefighting that is predictable from other strength and endurance
measures (Henderson et al., 2007). The substantial validities of both
g-loaded cognitive and strength/endurance-loaded physical tests imply
that firefighter screening procedures that jointly assess these two largely
independent abilities should have uncommonly high operational validities.
Because selection ratios tend to be low in most municipal fire departments,
the utility of these screening procedures is thus likely to be exceptionally
high.
Although the evidence for the validity of largely g-based cognitive and
strength/endurance-based physical tests is strong in both cases, the valida-
tion of these two predictors separately (or doing a meta-analysis of just one
of these predictors) is problematic when selection is based on a weighted
composite of both cognitive and physical measures. Such composite tests
require validation procedures that jointly assess the two ability domains,
accounting for their correlation and possibly different degrees of range
NORMAN D. HENDERSON 1003

restriction in the validation samples. A multivariate predictor analysis is


also necessary to determine optimal component weights and unbiased
validity coefficients for the composite test and its separate components.
When hiring is based on a composite score, validation of the Cognitive or
the Physical test alone will often produce validity coefficients that are bi-
ased downward because of a compensatory variable effect—the tendency
for selected incumbents with low scores on one predictor to have relatively
high scores on the second valid predictor. The effect is strongest when
selection ratios are low and joint selection is based on two low-correlated,
high validity predictors—the prototypical situation in large city firefighter
selection. The result is a negative correlation between the predictors in the
selected study sample (e.g., Sackett, Lievens, Berry, & Landers, 2007).
Because the downward bias effect is absent in two-step selection when
pass/fail is used on the predictor not being validated, a collection of in-
dividual fire service validity coefficients will usually contain a mix of
downward-biased and nonbiased estimates.
Legal challenges to a cognitive or a physical component of a fire-
fighter selection test may also extend to issues beyond the validity of
that specific test component. Disputes about the relative weighting of the
cognitive and physical components are likely, including possible claims
that the importance of these two abilities changes with age, experience,
and the evolution of equipment. It has also been argued that, because
on-the-job firefighter decisions and actions are often made under extreme
stress, they require different capabilities from those assessed in selection
and training (e.g., Luke v. City of Cleveland, 2006; U. S. v. The City of
New York, 2009). The degree to which these and related claims are true,
and that they undermine test validity, is best determined with a multivari-
ate cognitive-physical predictive validation strategy that extends well into
a firefighter’s job tenure. Such studies appear to be effectively nonexistent
for fire service.
Although adjusted operational validities are likely to be high for both
cognitive and physical predictors of firefighter performance, one must
first identify significant observed correlations or beta coefficients from
the data before proceeding to estimate operational validity. Often, how-
ever, screening test data for safety forces involves severe incumbent range
restriction and compensatory variable effects, drastically shrinking in-
dividual observed validity coefficients. In those opportunistic situations
where severe range restriction is avoided because of high selection ra-
tios or from departures from strict top-down hiring based on test scores,
criterion-related validation becomes a viable strategy without requiring
prohibitively large samples.
In 1985, we took advantage of one such hiring situation and described
a 23-year longitudinal validation study of a single cohort of firefighters
1004 PERSONNEL PSYCHOLOGY

who took the same entrance test and completed the same training academy
class. The design allowed us to estimate joint operational validities for
cognitive and physical test components and examine the career-long sta-
bility of the estimates for our incumbent sample. We were able to test for
predictor × predictor and predictor × tenure interactions relevant to issues
discussed earlier. Using additional measures obtained during training, we
also describe a post hoc procedure to determine what modifications of
the original test might result in increased validity and compare validi-
ties of physical firefighter simulations with direct assessments of strength
and endurance. Finally, we extend the case for the construct validity of
GCA and strength/endurance as factors underlying firefighter training and
career-long job performance.

Method

Participants

The study group consisted of the entire class of 64 men and


10 women who entered the fire-training academy of a large munici-
pal fire department in the United States in 1985. The group included
45 White and 29 Black or Hispanic cadets, all of whom successfully
completed academy training. Age at hire ranged from 20 to 36 years
(M = 27.3). Because of some prior consent arrangements designed to
ensure ethnic and gender diversity in the class, variance in both cognitive
and physical selection test scores was larger than that usually encoun-
tered in this department’s academy classes at that period. For the 1985
academy class, the cognitive selection test score had a SD of 4.51, com-
pared to the applicant pool SD of 7.50. The physical abilities time score
SDs were 26.0 and 37.9 for the academy sample and the applicant pool,
respectively.

Measures

Original Selection Test (Spring 1983)

The cognitive abilities component of the test had 120 questions in six
sections: (a) recall of study material from fire training manuals, includ-
ing diagrams; (b) reading comprehension based on technical materials
from fire service manuals, including mechanical diagrams and a graph;
(c) following a series of commands to navigate through a 5 × 5 letter grid;
(d) performing computations using simple formulas related to firefighting;
(e) drawing conclusions from brief written statements; (f) identifying a
NORMAN D. HENDERSON 1005

set of numbers, letters, or symbols that differed from the remaining sets.
Using the nomenclature of Carroll (1993), the six test sections reflected the
following first-order cognitive factors: (1) Associative Memory, Mean-
ingful Memory, and Visual Memory; (2) Reading Comprehension, Vi-
sualization, and Mechanical Knowledge; (3) Integrative Processes and
Sequential Reasoning; (4) Numerical Facility and Quantitative Rea-
soning; (5) Sequential Reasoning and Reading Comprehension; (6)
Induction.
KR-20 reliability coefficients of the test sections ranged from .65 to
.84 with a full-scale reliability of .93 (N = 2,157). Test section loadings
ranged from .72 to .81 on the first principal factor obtained from the scores.
The written portion of the firefighter entry exam was a highly g-saturated
selection test, reflecting what is commonly referred to as GCA in the
psychometric literature (e.g., Carroll, 1993) and sometimes referred to as
general mental ability (GMA) in the job selection literature (e.g., Schmidt
& Hunter, 2004). GCA is used in this report to refer to both terms and
the term intelligence, as defined by Cleary, Humphreys, Kendrick, and
Wesman (1975).
The physical abilities component of the selection test consisted of two
timed events. The first event simulated a fire scene arrival. Candidates were
fitted with the department’s self-contained breathing apparatus (SCBA)
tanks, without the mask and breathing tube. The applicant had to drag
two lengths of 6.4 cm hose a total of 55 m, 27.5 m in one direction,
drop the coupling, run to other end of hose, pick up and return; run
22 m to a pumper apparatus ladder rack, remove the 3.7 m one-person
straight ladder, weighing 15.9 kg, from the rack; carry the ladder into
the adjacent fire training tower, placing it against the back wall of the
first landing; continue up the stairwell to the fifth floor (total climb =
15.2 m); return to the ladder, retrieve it, and replace it on the pumper
rack. Candidates were briefed prior to the event and observed at least one
individual complete the event prior to testing. The applicant pool median
was 106.6 sec. After a 10 minute rest, candidates undertook Event 2, a
simulated rescue evolution, still wearing the SCBA tank. The evolution
consisted of dragging a 45.4 kg sack by a handle a total of 21.3 m, which
included 12 m of low headroom, narrow space (1 m high, 1.2 m wide).
Median seconds to complete the event was 38.8. Total seconds required
to complete the two timed events was used as the measure of entry-level
physical abilities test performance. In most critical fire-suppression and
rescue operations, the probability of success decreases with increasing
task completion time. Time required to execute fire ground tasks is the
widely accepted performance measure in fire service (e.g., Clark, 1973;
Cortez, 2001).
1006 PERSONNEL PSYCHOLOGY

Fire Academy Measures (Spring–Summer 1985, N = 74)

Examination grade average: weighted average of five written course


examinations (66.7%) and the final course examination (33.3%), all cre-
ated by fire academy instructors.
Emergency medical technician (EMT) state examination: average
score on the two EMT exams, developed by the state and administered
under high security conditions.
Critical skill deficiencies in EMT and/or SCBA: instructor warnings
about critical deficiencies in EMT technique or in the use of SCBA. Score
was 0 (no deficiencies) to 2 (deficiencies in both areas).
Mean instructor rating of physical and practical skills (PPS): compos-
ite rating of the ability to handle and use firefighting and rescue equipment,
excluding ladders (3–4 raters).
Mean instructor rating of handling ground ladders: rating of the ability
to carry, raise and properly use ground ladders of various sizes used in the
department (2–3 raters).
Academy fitness assessment. Time to complete a 1 mile (1.61 km) run
was used to assess aerobic fitness of cadets and a bench press weight lift
was used to assess upper body strength. In the bench press each cadet
received a one repetition maximum (1-RM) estimate of maximum weight
lifted (kg) once, using a Universal exercise machine. A 136 kg ceiling
was set for the 1-RM measure to reduce the influence of two high weight
outlier scores. Number of continuous push-ups completed was used to
assess upper body muscular endurance relative to body weight.

Physical Abilities Retest (Summer 1985)

During fire academy training the two timed events of the original selec-
tion test were rerun in an identical manner in 1985 with the cadets. After
a 15 minute rest, the dummy drag event was repeated to obtain a within-
session test–retest reliability coefficient (r xx = .83). After a 15 minute
rest, cadets were then required to complete as many lifts (reps) as possible
with a 15 kg barbell in 90 seconds. Lifts were done standing, using only
upper body strength. The task was repeated after another 15 minute rest to
obtain a within-session reliability coefficient (r xx = .61). The total num-
ber of lifts summed across both sessions, with a ceiling total of 120, was
used as the bar lift measure (Spearman-Brown corrected r xx = .76). The
test was designated to measure upper body muscular endurance, using a
relatively light weight with large number of reps.
NORMAN D. HENDERSON 1007

Officer Mean Global Ratings: T-scores (Fall 1992, N = 74)

The decision to use global ratings in this study was based on prior
research in the department, in which 315 firefighters were evaluated on
several types of criterion measures, following a job analysis (Henderson,
1985). Most of the variance in a 16 firefighter job dimensions scale derived
by Bownas and Heckman (1976) was accounted for by a single factor.
The reliabilities of individual rating dimensions and the full-scale average
rating were below those obtained with more global evaluations. Variance
in nine scales assessing knowledge, skills, abilities, and other traits related
to firefighting was largely accounted for by a cognitive (knowledge and
judgment) and a physical (strength and endurance) factor. These two scales
were subsequently adopted for the 1992 rating study, along with a work
output measure described below. An added benefit of the shorter global
ratings procedure was greater supervisor cooperation and higher return
rates.
All job performance ratings were designed to assess the rater’s cu-
mulative impression of the incumbent’s work behavior not just recent
behavior. In the 1992 and all later job ratings, all 74 incumbents were
rated on each occasion, including firefighters who had left the department
prior to the time of the rating. The five firefighters in the study sample
who had left by 1992 had 2–4 (M = 3.2) years of postacademy time on
the job, each observed by multiple supervisors. By the 2006 senior officer
ratings, 14 firefighters had left the department for a variety of reasons,
including injuries. This group had a mean of 9.5 years of active job duty
before leaving. In all ratings this study group represented only a subgroup
of a larger group of approximate agecohorts who were being evaluated at
that time.
A five-point rating scale was used for the three global 1992 ratings
described below. The 83 participating supervisors were instructed to only
rate firefighters that they worked with sufficiently to rate with confidence.
Supervising officers usually rated a large number of firefighters (M = 60.2,
approximately 11% of total list). We eliminated between-rater differences
in means and SD by converting each supervisor’s raw ratings on the three
scales (physical, knowledge, work output) into within-rater T-scores, with
M = 50 and SD = 10. The standardized T-score ratings were used to
compute average ratings for each firefighter. Mean T-score ratings for the
study sample of 74 were very close to total sample T-score means of 50
for each scale (48.7 to 50.4). Firefighters in the study sample received an
average of 10.4 supervisor ratings for each scale. Mean supervisor T-score
ratings were used as criterion measures for each participant.
Between-rater reliability coefficients were computed for both the full
sample and the study group, using the ANOVA methods described in
1008 PERSONNEL PSYCHOLOGY

Winer, Brown, and Michaels (1991, Appendix E). Reliability coefficients


for mean ratings were nearly identical for full and study sample groups
for knowledge and work output measures; reliability for the strength and
endurance rating was .91 for the selected sample versus .85 for the full
sample, reflecting the slightly greater variability in the selected sample.
Correlations among the ratings suggested the presence of a substantial
halo effect. The correlation between the knowledge/judgment rating and
the strength/endurance rating was .57, significantly greater than the cor-
relation of .16 between our incumbent sample’s Cognitive and Physical
selection exam scores (z = 4.87, P < .0001). Validity results are included
for the three separate ratings, as well as the composite, to show the effect
of the halo bias on corrected validity coefficients.
Knowledge and judgment required to follow orders, communicate ef-
fectively on the fire ground, and to make correct decisions while engaged
in firefighting and rescue procedures.
Physical strength and endurance required in performing rescue oper-
ations, forcible opening of structures and ventilation, ladder operations,
salvage an overhaul, and other physically demanding tasks.
Work output was an estimate of the productivity and effectiveness of
an incumbent using 100 as the benchmark score for an average firefighter.
A firefighter thought to do about 20% more work (or 20% more effective
or productive) than the average firefighter would receive a rating of 120
and a firefighter thought to do approximately two thirds of the work of
an average firefighter would receive a rating of 67. Examples were given,
with instructions to produce an average rating of approximately 100 and
to avoid extreme ratings. Individual ratings, standardized within raters,
were averaged for each study subject.
Composite global rating was calculated by summing z-scores of the
strength/endurance, knowledge/judgment, and work output ratings and
computing a T-score of the composite sum.

“Elite Firefighter Squad” Nominations (Winter 2001)

Supervisor Nominations. A total of 32 supervising officers each nom-


inated 10 out of 136 firefighters hired between 1984 and 1986 that they
would choose for an “elite firefighting and rescue unit.” Firefighters in
active service are typically observed by two to four supervising offi-
cers at each substantive fire or rescue scene they attend, and gradual
rotation in location and shift assignments rematches supervisors with
firefighters. Familiarity with individual active members of the study
group was thus assumed to be approximately constant after 15 years
of observation. The list included the names of firefighters who had left
the department, several of whom received one or more nominations.
NORMAN D. HENDERSON 1009

Because this form of nomination rating produces Poisson-like score distri-


butions with positive skew in number of nominations received, the square
root of the total number of nominations was used in analysis to eliminate
this scaling effect. The between-rater reliability of the total nominations
measure was computed using the ANOVA methods described for dichoto-
mous variables in Winer et al. (1991, Appendix E). This same procedure
was used for calculating reliability coefficients for peer nominations and
senior officer outstanding career nominations described below.
Peer nominations. A sample of 24 age-cohort firefighters served as
peer group raters who each nominated 10 firefighters in the same manner
as was done by supervisors. Scoring and analysis was the same as that
used for supervisor nominations.

Involvement in Fire-Suppression and Rescue Service (1987–2006)

Total months in active fire suppression and rescue. The number of


months of active fire-suppression service was obtained for each fire-
fighter from department duty assignment records for January 1, 1987
to April 1, 2006. Firefighters who had left the department or requested
assignments in nonfire-suppression positions (e.g., clerical, supply room,
public relations, etc.) accumulated fewer active fire-suppression months
over the study period. An exception was made when nonfirefighting as-
signments were the necessary result of injury (limited duty during recov-
ery period) or a promotion to a higher rank without fire-suppression job
duties.
Total number of fire and rescue runs during the study period. This
variable reflects both total months of service in fire suppression and the
activity levels of the fire companies worked in over the period—an estimate
of total fire-suppression work output of each firefighter between 1987
and 2006. Firefighters can request station house assignments, which are
filled on a seniority basis when openings occur. Average monthly fire-
suppression activity for each fire company was estimated from department
records. Based on their monthly assignments throughout this period, the
number of fire and rescue runs was calculated for each firefighter in the
study sample.

“Outstanding Career” Nominations From Senior Officers (Spring 2006)

Eight senior officers independently reviewed a list of 338 firefight-


ers, including our study sample, who had served 20–25 years and nomi-
nated those that they regarded as having “highly successful to outstanding
careers” based on demonstrated career-long ability to perform the most
demanding and important fire-suppression and rescue tasks. Each was
1010 PERSONNEL PSYCHOLOGY

asked to nominate approximately the top third of the firefighters whose


performance they knew well enough to evaluate. The square root of total
nominations was used for the same reasons described earlier for “Elite
Squad” nominations.

First Principle Component, Four Ratings (1992–2006)

The first principle component (PC) factor score was extracted from
the 1992 composite global rating, the 2002 “Elite Firefighter Squad”
supervisor ratings, the 2002 “Elite Firefighter Squad” peer ratings and
the 2006 senior officer “Outstanding Career Nominations” for the study
sample. This PC factor score represents common performance variance
based on assessments obtained by different methods and rater groups
obtained over nearly the full careers of the incumbents. It also provides a
parallel measure to the performance ratings latent trait used in structural
modeling.

Selection Examination Procedures and Applicant Pool Characteristics

The selection exam was given in two steps. Following completion of


the written component of the exam, the highest scoring 86% of test takers
were scheduled for the physical abilities component. Of the invited group,
88.3% of men and 84.4% of women took the physical test. Of the 1,300
applicants taking both components of the selection exam, 1,175 (90.4%)
were male and 125 were female. Six percent of the applicants (2.6% of
men; 62.4% of women; χ 2 (1) = 249.04, P < .0001, two-tail) failed to
complete the physical test and were thus not assigned time scores. For
adjustments for range restriction, the SD obtained from all candidates
taking the written exam, including those not invited to the physical, was
used. For the physical abilities component of the exam, the SD was based
on the time score of those who completed the exam, a slight underestimate
of applicant variability because most applicants failing to complete the
physical would have had long time scores if they had finished.

Results

Validity Coefficients of Cognitive and Physical Test Components

Table 1 shows observed (r XYi ) and operational (r XPa ) validity coef-


ficients of the Cognitive and Physical test components for predicting
the training and job criterion measures studied. Operational coefficients
were computed using the correction for criterion unreliability and indirect
range restriction described by Hunter, Schmidt, and Le (2006), hereafter
TABLE 1
Correlations Between Selection Examination Scores and Subsequent Training and Job Performance Measures

Cognitive testa Physical testb


c d
Criterion r YYi r XYi r XPa r XYi c r XPa d
Fire academy (Spring 1985)
Examination grade average [5 tests + final exam] .93t .67† .86 (.80, .90) .– –
EMT state examinations [2 tests, high security] .84t .74† .92 (.86, .96) .05 –
Critical skill deficiencies in EMT &/or SCBA .75i .68† .90 (.83, .96) .00 –
Mean instructor rating: physical and practical skills .75i .22 .40 (.03, .46) .71† .91 (.84, .96)
Mean instructor rating: Handling ground ladders .69i .19 .22 (.01, .44) .58† .84 (.70, .92)
Six month post academy knowledge retest .81t .39∗ ∗ ∗ .65 (.41, .80) .– –
First PC, academy knowledge measures .94t .76† .91 (.86, .95) .07 –
NORMAN D. HENDERSON

First PC, academy physical measures .78i .23∗ .25 (.04, .45) .72† .90 (.83, 95)
Officer mean global ratings: T-scores (Sept. 1992)
Firefighting knowledge and judgment .85i .52† .77 (.61, .87) .13 –
Physical strength and endurance .91i .30∗ ∗ .50 (.21, .70) .61† .80 (.68, .89)

Continued
1011
1012

TABLE 1 (continued)

Cognitive testa Physical testb


c d c
Criterion r YYi r XYi r XPa r XYi r XPa d
Work output .85i .49† .74 (.57, .85) .33∗ ∗ .52 (.25, .70)
Composite (K+S+W) global rating .94i .47† .69 (.51, .82) .42† .60 (.40, .75)
“Elite Firefighter Squad” (Jan. 2002)
Number of officer nominations [sqrt] .86i .30∗ ∗ .52 (.23, .71) .37∗ ∗ ∗ .57 (.34, .73)
Number of peer nominations [sqrt] .69i .33∗ ∗ .61 (.30, .79) .23 .42 (.08, .67)
Fire suppression & rescue 1987–2006
Total months in active S & R duty .99a .34∗ ∗ .53 (.27, .71) .53† .71 (.58, .81)
Estimated number of fire and rescue runs .99a .25∗ .43 (.12, .65) .36∗ ∗ ∗ .53 (.30, .68)
“Outstanding Career” (May 2006)
Total senior officer nominations [sqrt] .77i .46† .73 (.53, .86) .30∗ ∗ .50 (.21, .70)
First principal component 1992–2006 ratings
1st PC: KSW, O nom., P nom., senior O nom. .91i .47† .70 (.51, .82) .38∗ ∗ ∗ .57 (.35, .74)
a
Points earned based on items correct.
b
Time to complete hose/tower and dummy drag tasks. Signs are reversed on time measures so high scores signify better performance.
PERSONNEL PSYCHOLOGY

c
Observed correlations in incumbent sample.
d
Operational validity coefficients, corrected for indirect range restriction and criterion unreliability using Hunter, Schmidt, and Le (2006) procedure.
HSL corrected lower and upper values of r XYi 90% CI are shown in parentheses. Corrected coefficients and CI involving negative values are omitted (see
text).
Reliability coefficients: t = test-retest; i = interrater; a = assumed to be near unity. N = 74 for all criterion measures (see text).
One-tail significance levels (N = 74): ∗ P < .025; ∗∗ P < .01; ∗∗∗ P < .001; † P < .0001.
NORMAN D. HENDERSON 1013

designated as HSL coefficients. The HSL subscript notation (i = incum-


bents; a = applicants; P = performance construct; T = predictor construct)
is used throughout the paper, and r XPa is the operational validity coeffi-
cient. In parentheses are HSL corrected estimates of observed validity
coefficient (r XYi ) lower and upper 90% CI values, as suggested by Hunter
and Schmidt (1990), pg. 131, for direct range restriction. In the case of in-
direct range restriction, corrected values are only approximate upper and
lower CI estimates of r XPa because they exclude sampling variances in
reliability estimates used in HSL corrections for indirect range restriction.
Because the HSL procedure incorporates the Thorndike Case II correction
for direct range restriction using the reliability corrected coefficients, it is
not appropriate for observed negative validity coefficients. Sign changes
in coefficients can occur with Thorndike’s Case III equation for indirect
range restriction but not for the Case II equation. Because selection is
based on two predictor variables, HSL tends to under correct for range
restriction on criteria where both predictors are valid.
Observed validity coefficients for training measures split cleanly along
cognitive and physical dimensions. Operational validities of the Cogni-
tive and Physical tests were .91 and .90 for their respective knowledge
and physical training criteria. Observed validity coefficients for the Cog-
nitive and Physical tests did not differ significantly for job performance
rating measures except for the 1992 firefighter knowledge/judgment and
the physical strength/endurance ratings, where a difference would be ex-
pected. The halo error reflected in the earlier mentioned correlation of
.57 between the 1992 knowledge and strength ratings is manifested in the
significant .30 observed correlation (r XPa = .50) between Cognitive test
scores and the 1992 strength/endurance ratings. The counterpart effect,
the observed r between Physical test score and the knowledge/judgment
rating, was .13 and not significant. Although the relative sizes of validity
coefficients for the Cognitive and Physical tests were consistent with the
criterion measure being validated, halo bias is probably slightly inflating
cross-ability correlations.
Cognitive and Physical tests each had significant validity coefficients
for all four postacademy global performance ratings from supervising
and senior officers (1992 work output, 1992 composite rating, 2002 Elite
Squad nominations, and 2006 Outstanding Career nominations). Because
there was a small positive correlation between Cognitive and Physical test
scores in the selected sample, beta weights are slightly smaller than ob-
served correlations in Table 1. As indicated earlier, raters were instructed
to consider all 74 incumbents in the study group at each rating period
even though some had already left the department. Validity coefficients in
Table 1 are based on these whole class estimates. The validity coefficients
were also calculated on survivor samples of incumbents that had been
1014 PERSONNEL PSYCHOLOGY

active in the department within a year of the specific evaluation. Differ-


ences in full- and survivor-sample coefficients were minor. The largest
discrepancy among the 14 pairs of observed validity coefficients was .06
(full sample r = .46, still employed r = .52) for the Cognitive test predict-
ing 2006 outstanding career nominations from senior officers; the second
largest discrepancy was .04 (full sample r = .29, still employed r = .33)
for the Physical test predicting work output in 1992. All remaining pairs
of full- versus survivor-sample validity coefficients differed by less than
.025.
Observed correlations between the academy knowledge first PC and
supervisor job ratings in 1992, 2002, and 2006 were .31, .33, and .35,
respectively (all P < .003). Comparable observed correlations between
the academy physical first PC and supervisor job ratings were .55, .37,
and .40, respectively. Based on Thorndike’s Case III correction applied
to reliability corrected observed coefficients, the operational coefficient
between academy knowledge 1st PC and the 1st PC of all job ratings was
.62. The comparable coefficient for academy physical versus job ratings
1st PC was .60.
Using the observed correlation of .11 between applicant Physical and
Cognitive test scores, we estimated the operational validity coefficient
for the equally weighted Cognitive-Physical selection examination and
the first PC of all 1992–2006 ratings to be .86 (approx. 90% CI: .79–
.90). Operational coefficients of the selection exam for the three separate
supervisor ratings were .86 (1992 global), .73 (2002 Elite Squad) and .84
(2006 Outstanding Career); and .84 for total months of active duty.
When added to the selection exam score, instructor physical and prac-
tical skills (PPS) ratings were found to make small but significant incre-
mental predictions to all postacademy job supervisor ratings and to the
objective active duty measures (Mdn R2 increase = .064). The contribution
of PPS persists even when additional physical and cognitive measures are
added to the set of predictors. Substituting the 1985 physical exam retest
for the 1983 selection exam produced nearly identical results.
There is no evidence of nonlinear relationships between written or
physical exam scores and either fire academy performance or subsequent
job performance ratings (P > .3 in all cases). There is also no evidence of
interactive effects between the Cognitive and Physical selection test pre-
dictors for any criterion measure. Orthogonalized residual Cognitive ×
Physical interaction variables (Draper & Smith, 1981) were all nonsignif-
icant. The linearity of the relationships between cognitive and physical
abilities and training success and job performance ratings can be seen in
Figure 1. In the figure, the cognitive abilities selection test was used as the
GCA assessment. For the assessment of physical strength and endurance,
performance was based on the first unrotated principal component
NORMAN D. HENDERSON 1015

2.0 2.0

Academy Physical Task Ratings First PC Score


Fire Academy Performance First PC Score

1.0 1.0

0.0 0.0

-1.0 -1.0

-2.0 -2.0

r = .77 r = .70

-3.0 -3.0

30 35 40 45 50 55 60 -3.0 -2.0 -1.0 0.0 1.0 2.0


Cognitive Abilities Selection Test T-Score Physical Measures First Principal Component Score

3.0 3.0

Job Performance Ratings 1992-2006 First PC Score


Job Performance Ratinge 1992-2006 First PC Score

2.0 2.0

1.0 1.0

0.0 0.0

-1.0 -1.0

r = .48
r = .47
-2.0 -2.0

30 35 40 45 50 55 60 -3.0 -2.0 -1.0 0.0 1.0 2.0


Cognitive Abilities Selection Test T-Score Physical Measures First Principal Component Score

Figure 1: Cognitive (GCA) and Physical (SE) Abilities as Predictors of


Training Success and Career Performance for the Study Group.
Note: Nonlinear components were nonsignificant in all cases.

(1st UPC) score derived from six measures: performance on the 1983
physical abilities selection test, the 1985 retest, barbell reps, 1 rep max-
imum kg lift, push-ups, and the mile run. The 1st UPC score from all
six academy measures was used as the training success criterion for val-
idating GCA. The 1st UPC score obtained from the physical, practical,
and ground ladder performance ratings was used as the physical training
success criterion for validating physical strength and endurance.

Predictive Validity of Other Physical Ability Measures

The physical abilities component of the screening exam consisted of


two discreet events: the fire scene arrival, which included a hose drag
and tower climb, and the rescue simulation. These two events were sep-
arately correlated with the criteria listed in Table 1 to determine the sim-
ilarity of their predictive profiles. Although the hose and tower event
had a significantly higher mean predictor-criterion correlation across all
cognitive and physical criteria than the dummy drag event (mean r XYi =
1016 PERSONNEL PSYCHOLOGY

.41 vs. r XYi = .32), the profile of predictor-criterion correlations was nearly
identical (r = .97) across all criteria. The two Physical test events were
relatively short—applicant pool medians were 106.6 sec and 38.8 sec and
academy class medians were 94.0 sec and 31.0 sec. We examined how
the two events correlated with assessments of aerobic fitness (mile run
time), muscular strength (RM-1 weight lifted), and muscular endurance
(# barbell reps). Correlations were, respectively .60, .60, and .65 for the
hose drag and tower climb event and .46, .54, and .62 for the dummy drag
rescue event. Correlations with the 1985 retest were nearly identical to
those obtained with the 1983 entry test.
We examined the possibility of obtaining incremental validity of the
physical component of the 1983 selection test by adding the measures of
upper body muscular strength (1-rep maximum kg), muscular endurance
(number of 15-kg bar lifts), and aerobic endurance (mile run) in predicting
each of the physically based criterion measures. Neither mile run nor 1-rep
maximum kg significantly increase validity for any criterion measure but
adding the 15-kg barbell lifts score did increase the validity coefficients for
supervisor rating criteria in all three rating years and for the 1st PC based
on the ratings (R2 increases from .04 to .08; P = .057 to .001). The barbell
lifts measure also increases the validity coefficient for predicting number
of fire and rescue runs (R2 increase = .07; P < .02), and firefighters with
greater muscular endurance during training tended to be located in more
active station houses (r = .30, P < .01).
Table 2 summarizes the predictive validities of measures of upper body
muscular endurance, upper body strength, aerobic fitness, and their equal
z-score weighted sum (SEA composite) for physically based training and
job criteria. Predictive validities of the SEA composite were generally
comparable to those shown in Table 1 for the physical ability screening
test across the entire 21-year assessment period. Validity coefficients of
muscular endurance scores were consistently higher than comparable co-
efficients for strength and aerobic endurance measures across the 21-year
period, although strength and aerobic endurance were also significant
predictors for most criterion measures. The observed validity coefficient
(r XYi ) between muscular endurance and the ratings 1st PC (.53) was signif-
icantly larger than the comparable strength and aerobic endurance validity
coefficients of .29 and .26, respectively.
The physical ability selection test and the postacademy re-
administration of the test were separated by 26 months. Test–retest re-
liability for the incumbent class is.85 (r XXa ≈ .92). A regression of retest
time on initial test time showed a linear fit with slope of 1.24 (P < .0001)
with intercept not significantly different from zero. The mean time in-
creased 24% and the retest SD increased 45%, from 26.0 to 37.7. Despite
the decline in physical retest scores, validity coefficients for the original
TABLE 2
Correlations Between Strength and Endurance Measures and Subsequent Physically Demanding Training and Job Performance
Criteria

Muscular Aerobic
endurancea Strengthb endurancec SEA composited
r XYi e r XPa f r XYi e r XPa f r XYi e r XPa f r XYi e r XPa f
1985 fire academy evaluations
Physical and practical skills .58† .80 .48† .64 .31∗ ∗ .43 .61† .83
Handling ground ladders .46† .69 .45† .61 .21 .31 .50† .73
1992 Officer mean global ratings
Physical strength and endurance .68† .84 .45† .56 .30∗ ∗ .38 .62† .79
Work output .44† .63 .22 .30 .24∗ .31 .38∗ ∗ ∗ .56
Composite global rating .55† .72 .30∗ ∗ .38 .25∗ .32 .47† .64
“Elite Firefighter Squad” (Jan. 2002)
Number of officer nominations [sqrt] .50† .69 .29∗ ∗ .38 .28∗ ∗ .37 .47† .66
Number of peer nominations [sqrt] .32∗ ∗ .53 .21 .32 .16 .24 .31∗ ∗ .51
Fire Suppression and Rescue (1987–2006)
Total months in active duty .53† .69 .35∗ ∗ ∗ .43 .36∗ ∗ ∗ .44 .56† .70
Estimated number of fire and rescue runs .49† .65 .32∗ ∗ .39 .41∗ ∗ ∗ .49 .53† .69
NORMAN D. HENDERSON

“Outstanding Career” (May 2006)


Total senior officer nominations [sqrt] .45† .67 .21 .30 .19 .27 .37∗ ∗ ∗ .57
First principal component 1992–2006 ratings
1st PC: KSW, O Nom, P Nom, Sr. O Nom .53† .70 .29∗ ∗ .37 .26∗ .33 .47† .65
a
Number of repetitions with 15 kg barbell.
b
Maximum weight lifted (1-rep maximum).
c
Total time for mile run (sign reversed).
d
Equally weighted z-scores (z barbell reps + z RM – 1 weight – z mile sec). Observed correlations (r XYi ) among predictors: r MeSt = .46; r MeAe = .36;
r StAe = .21. Observed correlations with 1983 Physical test were .68, .59, and .58, respectively. r XPa coefficients are approx. 50% larger.
e
Observed correlations in incumbent sample.
f
Operational validity coefficients, corrected for indirect range restriction and criterion unreliability using Hunter, Schmidt, and Le (2006) procedure.
1017

M and 90% CI for composite score coefficients ranged from .83 (.68, .91) to .51 (.22, .70).
One-tail significance levels (N = 74): ∗ P < .02; ∗∗ P < .01; ∗∗∗ P < .001; † P < .0001.
1018 PERSONNEL PSYCHOLOGY

and retest scores were very similar—the mean r XYi in 1983 and 1985 for
the 18 physically based criterion measures was .60 and .61, respectively.
Henderson et al. (2007) found that when direct measures of strength
were used as predictors of performance on firefighter simulation events,
predictor-criterion relationships were linear at the upper range of strength
and endurance but showed a drop off in task performance at low strength
levels. We examined the current data for evidence of a similar drop-off
effect using the 1-rep maximum strength measure, the number of 15 kg
bar reps, mile run time, and the SEA composite as predictors of physically
based fire task performance, reflected in the 1983 Physical test and 1985
retest, and a fire academy PPS + ladders task rating composite score. The
performance drop-off effect was found for all three fire task performance
measures for the strength, muscular endurance, and SEA composite pre-
dictors. In each of the nine analyses, the quadratic component (squared
predictor score) produced a significant increment in prediction of task per-
formance, with P < .01 in eight of nine cases. The performance drop-off
effect was not found for any of the task measures when aerobic capacity
(mile run time) was used as the predictor.

Construct Validity

The construct validation analyses employed full information ML esti-


mation structural equation modeling (SEM). Adequacy of model fit was
assessed using the population root mean square error of approximation
(RMSEA), including its 90% CI (Steiger & Lind, 1980) and the PCLOSE
statistic associated with RMSEA, (Jöreskog & Sörbom, 1996). With re-
spect to the adequacy of a specific model to account for the data, Browne
and Cudeck (1993) suggested that a RMSEA ≤ .08 is a reasonable error
of approximation and that a RMSEA < .05 would signify a close fit of
the model. PCLOSE is the P value for testing the null hypothesis that
the population RMSEA has a value no greater than .05. Nonsignificant
(i.e., P > .05) values indicate an acceptable model; Jöreskog and Sörbom
(1996) suggest P > .50 to signify a good fit of the model.
Figure 2 summarizes the results of a structural equation model that
includes GCA and SE latent variables as predictors of long-term job
performance ratings and months of service. Although our sample N of
74 is comparatively low for SEM analysis, accurate recovery of factor
loadings is largely dependent on the communalities of the measured vari-
ables and not sample size (e.g., MacCallum, Widaman, Zhang, & Hong,
1999; MacCallum & Austin, 2000). In our case communalities are quite
high, as evidenced in the model. The model is also highly intuitive and
relatively parsimonious, with 46 parameters estimated from 119 distinct
sample moments. In the model we have separated the GCA and S/E latent
NORMAN D. HENDERSON 1019

2006 Outstanding
Career Nominations
2002 Elite Squad 2002 Elite Squad
Peer Nomination Officer Nomination
1992 Composite .85
Rating
.62 .85

.68

Performance
Ratings

.43
.51
Phys & Pract Ladder State EMT Academy Critical Task
Skills rating Rating Exams Grades Deficiencies

.85 .71 .87 .84 .80


# Months
Active Service
Academy Physical .66 .19 Academy Cognitive
Performance Performance

.87 .89

General Cognitive
Strength/Endurance Ability
.90 .87
.95 .95

Physical Exam Physical Entry Str/End/Aerobic Written Entry


Retest 1985 Exam 1983 (SEA) Composite Exam 1983

Figure 2: Structural Equation Model Showing GCA and SE Latent


Variables as Predictors of Long-Term Job Performance.
Note: Standardized regression coefficients are p < .0002, except β = .19.

ability variables from their corresponding fire academy cognitive-based


and physical-based training success latent variables. The paths between
the corresponding ability-training latent traits are .87 (CI: .78–.96) and
.89 (CI: .81–.96), showing the strong ability-training associations even
for this restricted range sample. Based on ML estimation, all standardized
regression coefficients shown in the diagram are significant at P < .0003,
except the marginally significant (P = .026) path from academy cognitive
performance to months of active service. The model in Figure 2 shows
a good fit to the data (χ 2 (73) = 78.51; RMSEA = .032, CI: .000–.076;
1020 PERSONNEL PSYCHOLOGY

PCLOSE = .697). An alternate model that combines the corresponding


ability-training latent variables for cognitive and physical measures shows
a significant decrement in fit (χ 2 (2) = 18.55, P < .001), relative to the
model in Figure 2, although regression weights are quite similar in both
models.
Because 46 parameters fit in the model is large relative to the sam-
ple N, we analyzed each of the four major measurement model com-
ponents in Figure 2 separately (academy physical, academy cognitive,
strength/endurance, performance ratings), fitting only 9 to 12 parameters
in each submodel. The individual component beta weights were highly
similar to their full-model counterparts in Figure 2—the mean absolute
discrepancy was .016 and the largest discrepancy was .04. In addition,
bootstrapped standardized regressions from the job performance ratings
latent variable to the four ratings taken between 1992 and 2006 all differ
by less than .02 from the ML estimates in Figure 2.
Using the full model, we obtained bootstrap estimates of the stan-
dardized regression weights and their CI, using the bias corrected per-
centile method. The mean beta weights and CI for the paths from S/E and
academy physical performance and from GCA and academy cognitive per-
formance were .88, (.78–.96) and .89, (.81–.96), respectively. The mean
beta weights and CI for the paths from academy physical performance
and from academy cognitive performance to job performance ratings
were .51, (.35–.65) and .44, (.29–.59), respectively. These two weights
do not differ significantly (P > .5). Adjusting for indirect range restriction
using Thorndike’s Case III formula, the two beta weights converge to
.61 (.50–.69) and .64 (.52–.74), respectively. Corrected beta coefficients
for cognitive measures show somewhat greater increases than the coef-
ficients for physical measures in Figure 2 because range restriction was
greater for cognitive measures. Although months of active service is pre-
dicted significantly by both cognitive (β = .19) and physical (β = .66)
performance, the physical beta weight is significantly larger (t(71) = 4.23,
P < .0001) than the cognitive weight.

Discussion

Validity of the Original Cognitive-Physical Selection Test

A major practical finding in this project is the high predictive validity


coefficient obtained for a selection test consisting of a measure of GCA
and an assessment of physical ability using firefighting tasks requiring
upper body strength, muscular endurance, and aerobic endurance (SE).
The high validity coefficients for GCA and SE appear immediately during
fire academy training and continue throughout the 21-year career interval
NORMAN D. HENDERSON 1021

studied. The g-saturated written test had operational validity coefficients


(r XPa ) from .86 to .92 for fire academy exams and critical skills perfor-
mance, and the SE-saturated physical abilities test had comparably high
predictive validities for instructor ratings of performance on job tasks that
demand high levels of physical ability. The high correlations between
both cognitive and physical screening test scores and their cognitive- and
physically based counterparts during training support a view that in fire
service there is little basis for an ability-achievement distinction in either
domain. Operational validity coefficients based on supervisor job ratings
in 1992, 2002, and 2006 and the 14-year composite ratings were .86, .72,
.84, and .86, respectively. There was also a high consistency in individual
Cognitive test and physical test validity coefficients across the evaluation
period.
Our analysis suggests that the 1983 Physical test could have been
improved slightly with the addition of a firefighting simulation task that
was highly dependent on upper body muscular endurance. Adding the
15 kg reps bar lift score to the original simulation events increases va-
lidity coefficients for supervisor ratings in all rating years. Fire academy
instructor ratings of PPS in equipment handling also appear to contain
predictive information about career-long job performance that is not re-
flected in selection test scores. The ratings produced small but significant
incremental validities to all supervisor job performance ratings and to
the objective active duty measures. Although such ratings could obvi-
ously not be used as part of the selection procedure, our finding sug-
gests that fire academy instructors are sensitive to sources of job perfor-
mance variance not being detected by the selection test, even when sup-
plemented with cognitive and physical ability measures obtained during
training.
There is no evidence of a time trend for corrected validity coefficients
for either the physical or the cognitive selection tests, which would have
signified Ability × Tenure interactions. Job experience neither attenu-
ated nor exaggerated performance differences related to ability in either
GCA or SE domains. Rating methodology differed across tenure, and
raters were judging cumulative long-term performance not discreet time
periods; hence, the data was not suitable for trend analysis. We can only
conclude that tenure-appropriate ratings produce similar validity coeffi-
cients throughout 21 years of fire service.
Performance ratings were not available for 6 years immediately fol-
lowing training, so early career Ability × Tenure interactions cannot
be ruled out. There is little evidence for early Ability × Tenure inter-
actions in other jobs, however. Schmidt, Hunter, Outerbridge, and Goff
(1988) studied four military jobs for 5 years posttraining and found high-
and low-GCA group differences in job knowledge, but work sample and
1022 PERSONNEL PSYCHOLOGY

supervisor ratings were constant during this period. McDaniel (cited in


Hunter and Schmidt, 1996) examined the United Stated Employment Ser-
vices (USES) database of civilian jobs and found a slight early rise in
GCA-job performance validity coefficients during job years 1–6 followed
by stability up to year 12, the last year examined. This study extends
the stability period to 21 years, indicating that GCA and SE predictors
are robust across age and evolving technical and procedural changes in
firefighters’ work environment.
It should be noted that validity coefficients for GCA and SE are
interdependent—the validity coefficients for GCA are based on a popula-
tion that has been screened on SE and vice versa. Because these contingent
validity coefficients are conditional on each other, it is incorrect to assume
that a screening test consisting of only one of these predictors would have
the observed validity coefficients shown in Table 1. In fire service the
linear compensatory model is often invalid at low score ranges found in
an applicant pool. Some applicants have insufficient levels of SE to per-
form common critical tasks and are not qualified for service regardless of
their GCA level. The reverse situation (very low GCA, high SE) can also
occur but is likely to involve fewer applicants because of the education
requirement. Computing operational validity coefficients (r XPa ) does not
circumvent the contingent validity issue.
Physical test performance also predicted the total months a firefighter
was involved in active fire suppression and rescue (r XPa = .71). Sub-
stantial investments in selection and training make turnover costs high
in fire service. In addition to overt turnover, which is generally low
in fire service, there is hidden turnover—firefighters who remain in a
department without being involved in active fire suppression. Fire de-
partments often temporarily assign injured firefighters to low physically
demanding nonfire combat jobs during their recovery period. These rel-
atively short assignments (considered active service in our analysis) are
a necessary component in a job that entails frequent injuries. More prob-
lematic are situations in which nonfirefighting jobs within a department
become filled on a long-term basis by firefighters with low physical abil-
ity who seek out these positions, gradually displacing more appropriately
trained and usually lower paid clerical and other workers. Total num-
ber of fire and rescue runs was also predicted by physical test scores,
but the effect is largely a function of months of active duty. We did
not observe a significant tendency of generally low physical ability in-
cumbents to transfer to less active fire stations, reported previously by
Doolittle and Kaiyala (1986), although firefighters showing greater mus-
cular endurance during training tended to be located in more active station
houses.
NORMAN D. HENDERSON 1023

Comparisons With Earlier Municipal Fire Service Validation Studies

Barrett et al. (1999) located 24 nonpublished reports of validation


studies of cognitive and cognitive/mechanical firefighter selection tests
from data collected between 1958 and 1985, apparently locating no re-
ports produced later. Based on 23 validity coefficients predicting training
success, the authors computed an operational coefficient of .77 and, based
on 47 coefficients using job performance criteria, the operational validity
of the screening tests was estimated to be .49. Operational validity coef-
ficients in the Barrett et al. meta-analysis used the Hunter and Schmidt
(1990) procedure for direct range restriction, which under corrects r XPa
coefficients. Based on the relationship between the 1990 and 2006 method-
ologies, we estimated the HSL adjusted operational validity coefficients
reported by Barrett et al. to be approximately .87 for training and .66
for job performance criteria. These coefficients are only slightly below
our operational validities of .91 for academy training success and .70
for career job performance ratings. Some studies included in the Barrett
et al. meta-analysis used joint linear-compensatory cognitive-physical se-
lection, which can produce downward bias in the validity coefficients due
to the compensatory variable effect described in the introduction and HSL
under corrects for this bias. Although the operational validity coefficients
reported by Barrett et al. and this study are rather close, the observed
validity coefficients, r XYi , are significantly larger in this study than the
mean observed r XYi coefficients reported by Barrett et al. (.50 vs. .76 for
training and .24 vs. .47 for job ratings). The higher coefficients in our
study reflect our higher than average criterion reliabilities resulting from
the use of many raters and many academy tests, and from using a sample
with less range restriction than often found in firefighter selection. Our
observed validity coefficients, roughly double those reported by Barrett
et al. for job performance, resulted in a power-gain equivalent of nearly
quadrupling our sample size.
There is evidence that operational validity coefficients for largely
g-loaded tests are somewhat lower in fire departments serving smaller
cities and towns. The International Public Management Association for
Human Resources (IPMA-HR) carried out two large scale concurrent val-
idation studies of two of their firefighter selection tests, used widely in
smaller municipalities (IPMA-HR 1996, 2009). Job analyses carried out
across 73 municipalities indicated that the basic abilities required of a new
firefighter differed little across jurisdictions, allowing the pooling across
cities. The validation sample for the earlier test consisted of 367 firefight-
ers from 51 communities; the later test validation used 251 firefighters
from 35 communities. Observed validity coefficients (r XYi ) for predicting
composite overall job performance rating were .25 and .24, resulting in
1024 PERSONNEL PSYCHOLOGY

an estimated pooled HSL operational validity coefficient (r XPa ) of .41,


noticeably lower than the large city validity coefficients reported in Bar-
rett et al. There is reason to believe that the lower validity is due in part
to the limited supervision duties of first-line supervisors in these small
departments and their close working and living relationships with their
ratees, making objective ratings difficult. Observed validity coefficients
for the earlier test were found to be .20 for ratings from first-line officers
(N = 301) and .46 from higher level supervisor ratings (N = 56), which
produce estimated HSL operational coefficients of .35 and approximately
.71, respectively.
Our results indicate that the relationships between GCA and training
and job performance are linear across the full range of incumbent test
scores, a finding consistent with the substantial body of validity research
involving GCA (e.g., Coward & Sackett, 1990). Furthermore, having a
substantive physical ability job requirement does not appear to diminish
the role of GCA in job success of firefighters. Despite the importance
of physical strength and endurance in firefighting, the validity coefficient
for GCA predicting job performance is in the .67–.70 range. Schmidt,
et al. (2008) estimated HSL operational validity coefficients to be .62 for
medium and .68 for high complexity jobs in the US.
A small number of validation studies for physical screening tests for
fire service were located, but only two provided sufficient information to
obtain HSL operational validity coefficients for training and job perfor-
mance. In a predictive design using a sample of 314 male firefighters,
Henderson (1985), using Bobko’s (1983) procedure, reported operational
validity coefficients for global ratings of job performance obtained by six
different methods. Validities ranged from .29 to .50 with a mean of .39
(HSL r XPa ≈ .55). Operational validities reported for ratings specifically
assessing task related strength and endurance were .69 (est. HSL r XPa ≈
.80) on the job and .29 (est. HSL r XPa ≈ .43) during training. In a concur-
rent design, Landy, Jacobs, and Associates (1987), using approximately
130 firefighters, reported an operational validity coefficient of .29 (est.
HSL r XPa ≈ .43) for a battery of fire simulation tasks predicting an overall
performance criterion based on the mean of 15 job dimension ratings and
an operational coefficient of .44 (est. HSL r XPa ≈ .61) for the mean of
eight physically based job dimensions. Note that in both of these studies,
HSL r XPa coefficients underestimate operational validity because of the
compensatory variable problem described in the introduction.
In this study, upper body strength and muscular endurance mea-
sures were usually better predictors of physical job criteria than aero-
bic fitness, although the three measures are correlated. Among simula-
tion tasks used by Landy et al. (1987), those depending heavily on up-
per body strength and muscular endurance were better predictors of job
NORMAN D. HENDERSON 1025

performance ratings of physically demanding firefighting activities than


simulation tasks with lower strength loadings. Their hose hoist task, a
prototypical upper body muscular endurance exercise, had an HSL oper-
ational validity of approximately.58 for overall job performance and .75
for specific physically based ratings. For predicting performance on ob-
jectively measured firefighting simulations, the results of Sothmann et al.
(2004) and Henderson et al. (2007) indicate that upper body SE is the
primary predictor of success.

General Cognitive Ability and Firefighter Performance—Construct Validity

The exceptionally high validity of GCA for predicting training per-


formance is to be expected if the various cognitively based achievement
measures of fire training success are viewed as alternative assessments of
intelligence. The position that there are no differences in kind between
general intelligence and achievement was made explicit by Cleary et al.
(1975): “Intelligence is defined as the entire repertoire of acquired skills,
knowledge, learning sets, and generalization tendencies considered in-
tellectual in nature that are available at any one period in time.” (p. 19)
and: “When a number of achievement tests in different subject matters are
administered, thus achieving a greater breadth on the achievement side,
the total score obtained from the test battery is very highly correlated
with measured intelligence. As a matter of fact, this correlation is about
as high as the intercorrelations among recognized tests of intelligence.”
(p. 21). This view has support from the psychometric community (Cron-
bach, 1975) and from educational (Lubinski, 2004) and I-O psychology
(Schmidt et al., 2008).
An examination of fire academy training requirements attests to the
breadth of cognitive demands required in academic achievement, includ-
ing knowledge and understanding general principles and procedures for
fire-suppression, rescue, and emergency medical situations. Recruits must
attain an understanding of science- and engineering-based principles that
can be applied in new situations and environments, including the chem-
istry and physics of fire, heat transfer, building construction, ventilation
principles, toxic materials, and human safety. It is hardly surprising that a
GCA factor accounts for most of the variance in the cognitive portion of
entry test and the cognitively based academy achievement measures used
in the study. Hunter and Schmidt (1996) demonstrated that, in both civil-
ian and military jobs, job knowledge is the major mediating link between
GCA and job performance. Our results are consistent with that conclusion
but further demonstrate that for fire service, knowledge acquisition, as
reflected in achievement tests in many different subject areas, is itself an
indicator of GCA.
1026 PERSONNEL PSYCHOLOGY

The high operational validity of the g-loaded cognitive test and the
strong relationship between cognitively based performance during training
and career-long job performance ratings observed in this study call into
question the veracity of assertions that cognitive assessments obtained in
nonstress environments are poor predictors of decision-making behavior
of firefighters in high stress work situations. The effects of stress and
time pressure on decision making (increased errors, neglect of peripheral
cues, reduced working memory capacity, and speed/accuracy trade-offs)
are reasonably consistent in the literature (e.g., Orasanu, 1997), but a
deep and extensive base of organized knowledge appears to mitigate these
stress effects (Klein, 1996). The knowledge base obtained by training
and experience is not only available for decision making, thus reducing
the capacity demand of the task, it can also lead to greater confidence
in an individual’s ability to deal with stressful situations by reducing the
stress of threat (Orasanu, 1997). Given this knowledge-base contribution
to enhanced performance under stress, finding the substantive path from
GCA to training success to career-long job success is not surprising.
The 1983 selection test examined here and most of the fire service tests
reviewed by Barrett et al. (1999) were developed using a content-based
validity model, with test item development based on knowledge, skills,
and abilities identified as important for job success. The criterion-based
validity evidence for these g-saturated tests provides a convergent line of
evidence for the job relatedness of these tests for firefighter selection. The
regression coefficients of the structural model in Figure 2 provide further
validity evidence at the construct level. The consistency of relationships
among the four job ratings obtained using different methods over many
years supports a general job performance rating construct, which is not
time or method dependent, and is correlated with an academy cognitive
performance latent variable, which in turn is highly correlated with GCA,
as reflected by the 1983 selection test.

Strength and Endurance and Firefighter Performance—Construct Validity

Correlated strength and aerobic endurance factors were previously


observed in a large sample of fire academy trainees assessed on several
strength measures, maximal oxygen uptake (VO 2 max), and body fat on
multiple occasions during training (Henderson et al., 2007). In this range
restricted incumbent sample, measures of strength, muscular endurance,
and aerobic capacity are correlated .27 to .46 and strength-endurance cor-
relations are found even in highly range restricted firefighter samples (e.g.,
Cady, Bischoff, O’Connell, Thomas, & Allan, 1979; Davis et al., 1982).
Our study demonstrates that there is considerable latitude in the meth-
ods used to assess strength and endurance. Simulated firefighter evolutions
NORMAN D. HENDERSON 1027

using a continuous series of tasks, a set of discrete individual task simu-


lations, direct measurements of strength and endurance, or a mix of these
three approaches all appear to produce comparable validity coefficients.
The finding corresponds closely to the results of Henderson et al. (2007)
involving a different set of firefighting simulations used in another mu-
nicipal fire department. Although SEM analysis was not used in early
firefighter studies, our data are consistent with these results (Davis et al.,
1982; Rosenfeld & Thornton, 1976). Studies of firefighter injury rates have
also demonstrated that higher SE levels are associated with fewer back
injuries (Cady et al., 1979) and fewer general musculo skeletal injuries,
despite the tendency for stronger firefighters to be in the more active fire
companies (Doolittle & Kaiyala, 1986). A review of the consequences
of low aerobic and muscular fitness in physically demanding jobs can be
found in Sharkey and Davis (2008).
The presence of a single physical task performance factor that ac-
counts for most of the variance among many different fire simulation ex-
ercises implies that a close matching of physical job tasks across different
fire departments may not be necessary to demonstrate transportability of
screening devices. Physical task samples also meet Fiske’s (1971) empir-
ical interchangeability criterion required for extrinsic convergent validity,
indicating that the task samples measure the same construct. In this study
the test-criterion correlation profiles for the two simulation exercises in the
1983 physical selection test were nearly identical. In addition, substituting
the strength, muscular endurance, and aerobic endurance (SEA) compos-
ite score for the original content-based physical abilities test component
resulted in only slightly lower validity coefficients than the firefighting
simulations. These results are consistent with the view that performance
variance in most physical firefighter evolutions is caused by variance in
upper body strength and muscular endurance and, to a lesser degree,
cardiovascular endurance. Because these three components are highly
correlated in heterogeneous populations, variation in the weights of these
abilities across different tasks should have minor effects on validity.
Isometric (e.g., Blakley et al., 1994) and isotonic 1-rep maximum
strength tests (Brown & Weir, 2001) provide reliable measurements in
units that are directly comparable across studies and several field measures
of aerobic endurance, such as the step test (e.g., Cotton, 1971; Davis &
Wilmore, 1990) or 1.5-mile run (e.g., Larsen et al., 2002), are readily
converted into estimates of maximal oxygen uptake (VO 2 max) in liters
per min or relative VO 2 max in l/[min · kg]. Adding some of these direct
measures of physical capacities to screening tests that use content-based
simulations would help build a database of physical measurements on
communicable scales reflecting underlying biological processes.
1028 PERSONNEL PSYCHOLOGY

The 1983 Physical test consisted of two distinct events, the first re-
quiring less than 2 minutes and the second requiring less than 1 minute
for most applicants. Although physical tests with discrete events were
common at that time, this test format began to come under criticism dur-
ing challenges to the validity of firefighter physical selection tests. Some
critics argued firefighting is primarily an aerobic activity, whereas short
duration (<5 min) events are primary indicators of strength with a minor
aerobic component. Physical selection tests began to evolve into exercises
in which applicants completed several tasks in one continuous event re-
quiring several minutes, with scores based on total completion time. Such
tests diminish the short duration criticism but at a cost of increased psy-
chometric complexity and less control over the relative weights of various
task components on the test score.
All relevant evidence in our study suggests that criticisms of shorter
discreet test events are unfounded. The short individual simulation events
both correlated highly with the mile run assessment of aerobic endurance
(hose/tower r = .62, dummy rescue r = .47), and adding mile run times
to the physical component of the 1983 selection exam did not increase
predictive validity for any criterion measure in Table 1. Our results con-
tradict the joint claims that critical firefighting and rescue physical tasks
are primarily aerobic and that short duration test events do not reflect
aerobic fitness. The first assertion appears to be based on expert witness
opinion in litigation unsubstantiated by empirical research. The second
assertion was based the results of Ästrand and Rodhal (1986), which
indicated that a 2-minute exercise duration is required for equal contri-
butions from aerobic and anaerobic processes. The Ästrand-Rodahl time
course was later found to be determined incorrectly (Medbo & Tabata,
1989) and that aerobic energy release accounted for 40% of total energy
during the first 30 sec and 50% during the first 60 sec of exhausting
exercise.
Our firefighter sample showed a substantial decline in retest scores
on the Physical test readministered 26 months later. Presumably at peak
fitness at the time of the 1983 Physical test, firefighters across the full
range of 1983 scores showed an average 24% increase in time to complete
the test simulations in 1985. Because time increases were proportional
to 1983 time scores, the absolute task completion time differences be-
tween high and low performing firefighters increased over this period. If
a physical ability cut point had been used in selection, many hires near
that cut point would have entered training already below the designated
minimum acceptable physical ability level. Results from our test-retest
design for the physical directly contradict the claim that, because physical
capabilities can change over time, long delays between testing and hire
can compromise the validity of physical screening tests. Two-year test–
retest reliability for the incumbents was .85 (applicant pool r XXa ≈ .92),
NORMAN D. HENDERSON 1029

and validity coefficients for the test and retest were nearly identical for
both training and job criteria, despite the drop-off in retest performance.
The results from this study and others cited suggest that variation
in firefighting knowledge and decision making is primarily GCA-based
for individuals with comparable experience. In a parallel manner, the SE
construct depicted in Figure 2 and found in numerous studies in the psy-
chological (e.g., Arnold, Rauschenberger, Soubel, & Guion, 1982; Hen-
derson et al., 2007; Hogan, 1991a, 1991b; Meyers, Gebhardt, Crump, &
Fleishman, 1993) and exercise literature (e.g., Berger, 1963; Marsh, 1993;
Mead & Legg, 1994; Shaver, 1971) appears to be a physical counterpart
of GCA in fire service. SE encompasses both the full range of physically
demanding fire-suppression and rescue activities, and direct measures of
physical strength, upper body muscular endurance, and aerobic fitness.
Direct measures of muscular strength and endurance do however appear
to over predict firefighter task performance for firefighters at the low end
of strength and endurance distributions.

Reducing the Weight of GCA and SE in the Firefighter Selection Process

Alternate Cognitive and Physical Measures With Lower GCA


and SE Loadings

The assertion that content-based firefighter-like screening tasks with


low GCA and SE loadings can be devised that capture cognitive and phys-
ical factors important for job success continues to receive little empirical
support. Following the Berkman ruling (1982) involving selection of New
York City firefighters, written and physical components of selection tests
tended to change to more resembling firefighting activities. Some of those
tests were challenged based on the contention that there is a high degree of
specificity in firefighting and that even moderate departures from precisely
defined job activities constitute a threat to validity. Our results suggest the
opposite, demonstrating that there is considerable latitude in the choice
of a set of specific measures that adequately assess GCA and SE, which
in turn predict firefighter success across a wide range of training and job
tasks. These results are consistent with the Barrett et al. (1999) validity
generalization study for GCA and with the Henderson et al. (2007) analy-
sis of SE. Claims that firefighter abilities are highly specific are based on
the assumption of a static work environment in which equipment and pro-
cedures change little across different fire and rescue situations or over an
incumbent’s career. Because fire service differs greatly from this charac-
terization, broadly based assessments of GCA and SE would be expected
to be the most robust long-term predictors of job performance.
There is a widely held belief that personality and related traits influ-
ence job performance, although the validation of selection test self-report
1030 PERSONNEL PSYCHOLOGY

personality measures is difficult because the observed coefficients are usu-


ally low. Using our operational validity coefficient of .86, for the GCA-SE
composite and assuming an operational validity of .20 for an added pre-
dictor, we approximated the effects of adding an uncorrelated personality
measure to the current selection test. Validity would increase from .86 to
.88, and changes in standardized test score mean differences, d, would be
less than 5% for both White-minority (increased d) and male-female (de-
creased d) comparisons. These small and not always d-reducing effects
of adding a personality measure to our test are consistent with results
of Schmitt, Rogers, Chan, Sheppard, and Jennings (1997), Ryan, Ploy-
hart, and Friedel (1998), and Potosky, Bobko, and Roth (2005). “Faking
good” on self-report measures (e.g., Rosse, Stecher, Miller, & Levin,
1998; Winkelspecht, Lewis, and Thomas, 2006) is also problematic in
safety forces testing, where test preparation services provide coaching
and sample personality scales to applicants.
Municipalities and courts seeking solutions to the disparate impact
dilemma in hiring have been swayed by assertions that including or
substituting personality and related measures in selection tests can pro-
vide a partial solution to the problem of adverse impact related to gen-
der and ethnicity without validity loss. This prospect may be illusory
with respect to fire service selection. An overview of difficulties posed
by the use of broad personality dimensions in job selection was pro-
vided by Murphy and Dzieweczynski (2005) and Morgeson et al. (2007a,
2007b).

Predictor Weighting in Multivariate Screening Procedures

An issue of considerable importance and thus likely to be contested in


litigation is the relative weighting of cognitive and physical job predictors
in selection. Given the closeness and lack of significance between GCA
and SE validities in Table 1 and the similar results at the construct level
shown in Figure 2, our results suggest that equal weighting of cognitive
and physical test scores is likely to be the most defensible and robust
option (e.g., Dawes & Corrigan, 1974) for an applicant pool similar to
ours, containing approximately 10% women. Because they are incumbent
referenced rather than applicant referenced, two commonly used weight-
ing procedures can produce substantial discrepancies from the weights
obtained with range-adjusted coefficients for any set of predictor vari-
ables. The first procedure—using regression weights based on observed
correlations from incumbents—will be distorted when there is differential
range restriction on the predictor variables. These effects can be large
when applicant-to-incumbent SD ratios differ noticeably among predic-
tors. The second procedure—deriving subtest weights from task analysis
NORMAN D. HENDERSON 1031

questionnaires in content validation studies—uses no information about


incumbent and applicant pool ability distributions; hence, appropriate
weightings for the applicant screening test cannot be determined from the
available data.
Our results indicate that maximum test utility would be achieved by us-
ing an approximately equally weighted linear composite of a GCA-based
and a SE-based ability test for selection in a top-down manner. Deviations
from this procedure result in lower test utility and may produce weights
of the two components that deviate substantially from those determined
optimal for the job. For example, compared to optimal top-down hiring
using an equally weighted GCA and SE test scores, selection using a cut-
point at the 10th percentile on SE and top-down hiring on GCA results
in approximately a 23% lower test utility, a 58% increase in the range in
composite ability of selected applicants, a 1.15 SD difference in GCA and
SE standardized test means, and a 7:1 GCA:SE weighting in selection.
Primary reliance on cognitive tests for fire service selection when physical
ability and other characteristics are considered important for the job has
been rejected on both validity and adverse impact grounds (Bradley et al.,
v. City of Lynn et al., 2006). The technical difficulties posed by multiple
hurdle models are addressed by Hunter et al. (2006), and Mendoza et al.
(2004) provide an understanding of the biasing effects of multiple hurdle
designs and insight regarding the framing of range restriction as a missing
data problem.
One argument for relaxed hiring standards for cognitive or physical
ability levels is that supervisors can “work around” firefighters with low
ability in either area by judicious task assignments. Typically, the first
responder apparatus arrives with a supervising officer and two to four
firefighters, depending largely on the city’s resources. Supervisor flexi-
bility is thus limited most severely in less affluent municipalities where
fire risks also tend to be higher and response times longer. Even in sit-
uations where flexibility in assignments is possible, ethical and practical
labor equity issues arise when workers with highest ability levels are
disproportionately assigned to the most demanding and highest risk job
activities. Increasing personnel on the fire ground to compensate for low
physical ability is successful only to a limited degree. Increasing team
size reduces completion time for some firefighting team exercises, but the
effect is confined to team increases from N = 2 to N = 4 (Cortez, 2001).
The consequences of selection procedures that use lottery-based rules
with minimum thresholds of acceptance are rarely attractive (Grofman &
Merrill, 2004), and Berk (1996) has described the pitfalls in setting certi-
fication standards.
1032 PERSONNEL PSYCHOLOGY

Limitations of the Current Study

The design as implemented in this study has several limitations. First,


observed validity coefficients are highly significant, but CIs of operational
validity coefficients are large due to our relatively small sample size. To
some extent this problem is ameliorated by the large observed validity
coefficients, the consistency of the operational validities across multiple
criteria, and by unambiguous associations among latent variables in SEM.
Second, using a single training class cohort reduced potential sources of
extraneous performance variance in the incumbent sample but at a cost to
generalizing to other age cohorts or municipalities. Because our sample
could be idiosyncratic in some respects, replication is in order for effects
found that have not previously been reported in other fire departments.
Where prior data are available, our results are consistent with the earlier
findings, but we found no other test-retest studies of firefighter physical
entry exams. Our results—a high test-retest reliability, nearly identical
validity coefficients from the two test administrations, a rapid drop-off in
performance from test to retest, and greater absolute performance losses
for low scoring applicants— all have implications for firefighter selection
that are sufficiently important to call for replication. The paucity of acces-
sible studies relating physical ability to job effectiveness of firefighters is
difficult to justify, given their importance in selection decisions for one of
the most physically demanding of all occupations (e.g., Sharkey & Davis,
2008).
The third limitation in this study involves the use of the HSL correction
for indirect range restriction and criterion unreliability to estimate opera-
tional validity coefficients. The model on which this correction is based
assumes that the suitability composite being used in selection does not as-
sess any characteristic that would affect the criterion variable for reasons
other than those measured by a single predictor variable, X, limiting range
restriction effects to those mediated by X. The assumption is frequently
violated in our study because the suitability composite includes two valid
predictors (tests of GCA and SE) for many of the criterion measures.
The consequence of violating the HSL model is usually varying degrees
of under correction in estimating operational coefficients (Hunter et al.,
2006; Le, 2003).
The fourth limitation pertains to the job demands in this large city
department with substantial heavy industry and a large percentage of
older residential buildings. These factors lead to a high percentage of
calls involving structural fires, which is often not the case in primarily
residential municipalities containing largely post-1960 structures. Our
results may not pertain in many of these large newer communities or to
small municipalities.
NORMAN D. HENDERSON 1033

Practical Implications

Developers of firefighter entry-level screening tests must demonstrate


validity for the separate cognitive, physical, and other screening compo-
nents and then justify the relative weights assigned to each component in
the selection process. The validation of the GCA and SE is likely to be
less challenging than establishing defensible test component weights and
maintaining them through the screening process. Because adverse impact
operates in opposite directions for gender and for ethnic differences as
GCA:SE weights shift, any weighting decision can be contentious.
Stating that general intellectual ability and muscular strength/
endurance are important in firefighting is likely to elicit a “we already
know that” response from most I-O psychologists. The courts, however,
appear less convinced with respect to both of these broadly defined abil-
ities, often dismissing validity generalization evidence in the absence of
local validation and discernible test specificity (e.g., Biddle & Nooren,
2006; EEOC v. Atlas Paper, 1989).
Because test security issues rule out most preexamination criterion-
based validation efforts, developing cognitive and physical tests that are
strongly content based, derived from a local job analysis, is a prudent
strategy even when a criterion-based validation study is planned. Because
direct measures of strength and endurance have some distinct advantages
as predictors and can be used to support construct validity of firefighter
simulation tasks, we suggest adding them to the criterion validation study.
Our work suggests that fire training academy written materials and fire-
suppression task training exercises are valuable for creating test content
because training performance in both cognitive and physical domains
correlates highly with later job performance ratings.
Most cognitive test sections designed to assess judgment and the abil-
ity to meet training demands of fire service will be heavily g-saturated,
and most realistic physical firefighter task simulations are likely to be
heavily SE-saturated. Both components are thus vulnerable to the misuse
of exploratory factor analysis to challenge test content because analysis
of a heterogeneous applicant pool is likely to produce single factors in
each case (see e.g., U. S. v. The City of New York, 2009). Using appro-
priate psychometric methods and structural models to separately examine
the internal structures and subtest relationships of the cognitive and the
physical tests is now a judicious component of content validation.
Content validation provides no quantitative link between task impor-
tance/difficulty assessments obtained in job analyses and the weights of
test components used to select from an applicant pool with a multivariate
distribution of abilities different from that of incumbents. When techni-
cally feasible, a criterion-based GCA+SE validation follow-up is the best
1034 PERSONNEL PSYCHOLOGY

source of weighting information. Based on our observed validity coeffi-


cients and the ability distributions of both our selected incumbents and
the 10% female applicant pool, we determined that approximately equal
GCA:SE weighting maximized test validity. Large gender difference in
physical test scores necessitate that this weighting be modified for an
applicant pool with different male–female proportions.
Test component weights determined to be optimal in validation re-
search can become substantially altered in the selection process if di-
chotomous scoring with a low cut point is adopted for one of the predic-
tors. This procedure is frequently used for physical ability screening, even
though in critical fire-suppression and rescue operations, property and life
losses increase, often at an accelerating rate, with longer task completion
time. The direct links between speed of task completion and firefighting
success, the methodological problems encountered in establishing a min-
imum acceptable proficiency cutoff for physical ability (Henderson et al.,
2007), and the large discrepancy in cognitive and physical weightings in
selection created by low cutoffs should all be considered before adopting
a cutoff procedure. The Bradley et al. v. City of Lynn et al. (2006) deci-
sion has particular relevance to this issue. The situation involving cutoff
scores became further complicated by the 3rd Circuit’s ruling in Lanning
v. SEPTA (1999, 2002), stating that “a discriminatory cutoff score must
be shown to measure the minimum qualifications necessary to perform
successfully the job in question” and indicating that this standard should
be evaluated separately from test validity and utility. Apart from the legal
issues raised by this Court’s interpretation (Gutman, 2009; Sarno, 2003)
strict adherence to this requirement may not be technically possible for
most firefighting jobs.
We believe that an accumulation of criterion-based validity informa-
tion jointly involving GCA- and SE-based measures is essential to assess
the utility, fairness, and transportability of current fire selection practices.
In fire service, replications employing longitudinal designs with multi-
variate predictors and criteria are likely to be more informative than larger
N concurrent validation studies that examine a single predictor. Although
test developers are currently providing relatively little of this information,
fire departments with detailed personnel records might be a potential re-
source for retrospective criterion validation research when selection test
scores and training data are available.

REFERENCES

Arnold JD, Rauschenberger JM, Soubel WG, Guion RM. (1982). Validation and utility of
a strength test for steelworkers. Journal of Applied Psychology, 67, 588–604.
Arvey RD, Nutting SM, Landon TE. (1992). Validation strategies for physical ability testing
in police and fire settings. Public Personnel Management, 21, 301–312.
NORMAN D. HENDERSON 1035

Ästrand P, Rodhal K. (1986). Textbook of work physiology. New York: McGraw-Hill.


Barrett GV, Polomsky MD, McDaniel MA. (1999). Selection tests for firefighters:
A comprehensive review and meta-analysis. Journal of Business and Psychology,
13, 507–514.
Beaton RU, Murphy S, Salazar M, Johnson LC. (2002). Neck, back, and shoulder pain
complaints in urban firefighters: The benefits of aerobic exercise. Journal of Mus-
culoskeletal Pain, 10, 57–67.
Berger RA. (1963). Classification of students on the basis of strength. Research Quarterly,
34, 514–515.
Berk RA. (1996). Standard setting: The next generation (where few psychometricians have
gone before!). Applied Measurement in Education, 9, 215–235.
Berkman v. City of New York, 536 F. Supp. 177 (E. D. N. Y. 1982).
Biddle DA, Nooren PM. (2006). Validity generalization vs. Title VII: Can employers
successfully defend tests without conducting local validation studies? Labor Law
Journal, 57, 216–237.
Blakley BR, Quinones MA, Crawford MS, Jago IA. (1994). The validity of isometric
strength tests. P ERSONNEL P SYCHOLOGY, 47, 247–274.
Bobko P. (1983). An analysis of correlations corrected for attenuation and range restriction.
Journal of Applied Psychology, 68, 584–589.
Bownas DA, Heckman RW. (1976). Job analysis of the entry level firefighter position.
(Report No. TS 77–8). Washington, DC: U. S. Office of Personnel Management.
Bradley et al. 443 F. Supp. 2d 145 (D. Mass. 2006).
Brown LE, Weir JP. (2001). ASEP procedures recommendation I: Accurate assessment of
muscular strength and power. Journal of Exercise Physiology online, 4, 1–21.
Browne MW, Cudeck R. (1993). Alternative ways of assessing model fit. In Bollen KA,
Long JS (Eds.), Testing structural equation models (pp. 136–162). Newbury Park,
CA: Sage.
Cady LD, Bischoff DP, O’Connell ER, Thomas BA, Allan JH. (1979). Strength and fitness
and subsequent back injuries in firefighters. Journal of Occupational Medicine, 21,
269–272.
Campbell JP. (1982). Some remarks from an outgoing editor. Journal of Applied Psychol-
ogy, 67, 691–700.
Carroll JB. (1993). Human cognitive abilities. Cambridge: Cambridge University Press.
Clark W. (1973). Firefighter principles and practices. New York: Dunn-Donnelley.
Cleary TA, Humphreys LG, Kendrick SA, Wesman A. (1975). Educational uses of tests
with disadvantaged students. American Psychologist, 30, 15–41.
Cortez L. (2001). Fire company staffing requirements: An analytical approach. Fire Tech-
nology, 37, 199–218.
Cotton DJ. (1971). A modified step test for group cardiovascular testing. Research Quar-
terly, 42, 91–95.
Coward WM, Sackett PR. (1990). Linearity of ability-performance relationships: A recon-
firmation. Journal of Applied Psychology, 75, 297–300.
Cronbach LJ. (1975). Five decades of public controversy over mental testing. American
Psychologist, 30, 1–14.
Cucina JM, Vasilopoulos NL. (2005). Non-linear personality-performance relationships and
the spurious moderating effects of traitedness. Journal of Personality, 71, 227–259.
Davis JA, Wilmore JH. (1990). Validation of a step test for cardiorespiratory fitness clas-
sification of emergency service personnel. Journal of Occupational Medicine, 21,
671–673.
Davis PO, Dotson CO, Santa Maria DL. (1982). Relationship between simulated fire
fighting tasks and physical performance measures. Medicine and Science in Sports
and Exercise, 14, 65–71.
1036 PERSONNEL PSYCHOLOGY

Dawes RM, Corrigan B. (1974). Linear models in decision making. Psychological Bulletin,
81, 95–106.
Doolittle TL, Kaiyala K. (1986). Strength and musclo-skeletal injuries of firefighters.
Proceedings of the Annual Conference of the Human Factors Association of Canada,
49–52. Vancouver, BC.
Draper N, Smith H. (1981). Applied regression analysis. (Revised ed.). New York: Wiley.
EEOC v. Atlas Paper, 868 F. 2d 1487 (6th Cir. 1989), cert. denied, 493, U.S. 814.
Fiske DW. (1971). Measuring the concepts of personality. Chicago: Aldine-Atherton.
Grofman B, Merrill S. (2004). Anticipating likely consequences of lottery-based affirmative
action. Social Science Quarterly, 85, 1447–1468.
Gutman A. (2009). Major EEO issues relating to personnel selection decisions. Human
Resource Management Review, 19, 232–250.
Harbin G, Olson J. (2005). Post-offer, pre-placement testing in industry. American Journal
of Industrial Medicine, 47, 296–307.
Henderson ND. (1985). Validity and utility of a cognitive-physical screening test for fire-
fighters. (Technical report). Oberlin, OH: Henderson & Associates.
Henderson ND, Berry MW, Matic T. (2007). Field measures of strength and fitness predict
firefighter performance on physically demanding tasks. P ERSONNEL P SYCHOL -
OGY , 60, 431–473.
Hoffman CC. (1999). Generalizing physical ability test validity: A case study using test
transportability, validity generalization, and construct related validation evidence.
P ERSONNEL P SYCHOLOGY, 52, 1019–1042.
Hogan J. (1991a). Physical abilities. In Dunnette MD, Hough LM (Eds.), Handbook of
industrial-organizational psychology (2nd ed., pp. 753–831). Palo Alto, CA: Con-
sulting Psychologists Press.
Hogan J. (1991b). The structure of physical performance in occupational tasks. Journal of
Applied Psychology, 76, 495–507.
Hunter JE, Schmidt FL. (1990). Methods of meta-analysis: Correcting error and bias in
research findings. Beverly Hills, CA: Sage.
Hunter JE, Schmidt FL. (1996). Intelligence and job performance: Economic and social
implications. Psychology, Public Policy and Law, 2, 447–472.
Hunter JE, Schmidt FL. (2004). Methods of meta-analysis: Correcting error and bias in
research findings. Thousand Oaks, CA: Sage.
Hunter JE, Schmidt FL, Le H. (2006). Implications of direct and indirect range restriction
for meta-analysis methods and findings. Journal of Applied Psychology, 91, 594–
612.
International Public Management Association for Human Resources. (1996). B3R and B4R
Technical Report. Alexandria, VA: Author.
International Public Management Association for Human Resources. (2009). B-5/B-5a and
300 Series Technical Report. Alexandria, VA: Author.
Jackson AS. (1994). Preemployment physical evaluation. Exercise and sport sciences re-
views, 22, 53–90.
Jöreskog KG, Sörbom D. (1996). Structural equation modeling. Workshop presented for
the NORC Social Science Research Professional Development Training Sessions.
Chicago, IL.
Klein GA. (1996). The effect of acute stressors in decision making. In Driskell JE, Salas E.
(Eds.), Stress and human performance (pp. 49–88). Mahwah, NJ: LEA
Landy, Jacobs, Associates. (1987). Report on the criterion-related validity of the physical
capabilities test (PCT) for entry level firefighter positions in Columbus, Ohio. State
College, PA: Author.
Lanning v. Southeastern Pa. Transp. Auth. (Lanning I), 181 F .3d 478 (3d Cir. 1999).
Lanning v. Southeastern Pa. Transp. Auth. (Lanning II), 308 F .3d 286, 290 (3d Cir. 2002).
NORMAN D. HENDERSON 1037

Larsen GE, George JD, Alexander JL, Fellingham GW, Aldana SG, Parcell AC. (2002).
Prediction of maximum oxygen consumption from walking, jogging, or running.
Research Quarterly for Exercise and Sport, 73, 66–72.
Le HA. (2003). Correcting for indirect range restriction in meta-analysis: Testing a new
meta-analytic method. Unpublished doctoral dissertation, University of Iowa.
Lubinski D. (2004). Introduction to the special section on cognitive abilities: 100 years after
Spearman’s (1904) “general intelligence,” objectively determined and measured.
Journal of Personality and Social Psychology, 86, 96–111.
Luke v. City of Cleveland, 2006 WL 43759 (N.D. Ohio).
MacCallum RC, Austin JT. (2000). Applications of structural equation modeling in psycho-
logical research. In Fiske ST, Schacter DL, Zahn-Waxler C (Eds.), Annual review
of psychology (Vol. 51, pp. 201–226). Palo Alto, CA: Annual Reviews, Inc.
MacCallum RC, Widaman KF, Zhang S, Hong S. (1999). Sample size in factor analysis.
Psychological Methods, 4, 84–99.
Marsh HW. (1993). The multidimensional structure of physical fitness invariance
over gender and age. Research Quarterly for Exercise and Sport, 64, 256–
273.
Mead TP, Legg DL. (1994). Exploratory versus confirmatory factor analysis of collegiate
physical fitness. Education Resources Information Center, (ED 379336), 1–10.
Medbo JI, Tabata I. (1989). Relative importance of aerobic and anaerobic energy release
during short-lasting exhausting bicycle exercise. Journal of Applied Physiology, 67,
1881–1886.
Mendoza JL, Bard DE, Mumford MD, Ang SC. (2004). Criterion-related validity in multiple
hurdle designs: Estimation and bias. Organizational Research Methods, 7, 418–
441.
Mendoza JL, Mumford M. (1987). Correction for attenuation and range restriction on the
predictor. Journal of Educational Statistics, 12, 282–293.
Meyers DC, Gebhardt DL, Crump CE, Fleishman EA. (1993). The dimensions of human
physical performance: Factor analyses of strength, stamina, flexibility and body
composition measures. Human Performance, 6, 309–344.
Morgeson FP, Campion MA, Dipboye RL, Hollenbeck JR, Murphy K, Schmitt N. (2007a).
Reconsidering the use of personality tests in personnel selection contexts. P ERSON -
NEL P SYCHOLOGY , 60, 683–729.
Morgeson FP, Campion MA, Dipboye RL, Hollenbeck JR, Murphy K, Schmitt N. (2007b).
Are we getting fooled again? Coming to terms with limitations in the use of person-
ality tests for personnel selection. P ERSONNEL P SYCHOLOGY, 60, 1029–1049.
Murphy KR, Dzieweczynski JL. (2005). Why don’t measures of broad dimensions of
personality perform better as predictors of job performance? Human Performance,
18, 343–357.
Orasanu J. (1997). Stress and naturalistic decision making: Strengthening the weak links. In
Flin R, Salas E, Strub M, Martin L (Eds.), Decision making under stress: Emerging
themes and applications (pp. 43–66). Ashgate: Aldershot.
Potosky D, Bobko P, Roth PL. (2005). Forming composites of cognitive ability and alter-
native measures to predict job performance and reduce adverse impact: Corrected
estimates and realistic expectations. International Journal of Selection and Assess-
ment, 13, 304–315.
Rhea MR, Alvar BA, Gray R. (2004). Physical fitness and job performance of firefighters.
Journal of Strength and Conditioning Research, 18, 348–352.
Rosenfeld M, Thornton RF. (1976). The development and validation of a firefighter phys-
ical selection test for the City of Philadelphia. Princeton, NJ: Educational Testing
Service, Center for Occupational and Professional Assessment.
1038 PERSONNEL PSYCHOLOGY

Rosse JG, Stecher MD, Miller JL, Levin RA. (1998). The impact of response distortion
on preemployment personality testing and hiring decisions. Journal of Applied
Psychology, 83, 634–644.
Ryan AM, Greguras GJ, Ployhart RE. (1996). Perceived job relatedness of physical ability
testing for firefighters: Exploring variations in reactions. Human Performance, 9,
219–240.
Ryan AM, Ployhart RE, Friedel LA. (1998). Using personality testing to reduce adverse
impact: A cautionary note. Journal of Applied Psychology, 83, 298–307.
Sackett PR, Lievens F, Berry CM, Landers RN. (2007). A cautionary note on the effects of
range restriction on predictor intercorrelations. Journal of Applied Psychology, 92,
538–544.
Sackett PR, Yang H. (2000). Correction for range restriction: An expanded typology.
Journal of Applied Psychology, 85, 112–118.
Sarno MR. (2003). Issues in the third circuit: Employers who implement pre-employment
tests to screen their applicants, beware (or not?): An analysis of Lanning v. South-
eastern Pennsylvania Transportation Authority and the business necessity defense
as applied in Third Circuit employment discrimination cases. Villanova Law Review,
48, 1403–1428.
Schmidt FL, Hunter J. (2004). General mental ability in the world of work: Occupational
attainment and job performance. Journal of Personality and Social Psychology, 86,
162–173.
Schmidt FL, Hunter J, Outerbridge AN, Goff S. (1988). Joint relation of experience and
ability with job performance: Test of three hypotheses. Journal of Applied Psychol-
ogy, 73, 46–57.
Schmidt FL, Shaffer JA, Oh IS. (2008). Increased accuracy for range restriction corrections:
Implications for the role of personality and general mental ability in job and training
performance. P ERSONNEL P SYCHOLOGY, 61, 827–868.
Schmitt N, Rogers W, Chan D, Sheppard L, Jennings D. (1997). Adverse impact and pre-
dictive efficiency of various predictor combinations. Journal of Applied Psychology,
82, 719–730.
Sharkey BJ, Davis PO. (2008). Hard work: Defining physical work performance require-
ments. Champaign, IL: Human Kinetics.
Shaver LG. (1971). Maximum dynamic strength, relative dynamic endurance, and their
relationships. Research Quarterly, 42, 460–465.
Sothmann MS, Gebhardt DL, Baker TA, Kastello GM, Sheppard VA. (2004). Performance
requirements of physically strenuous occupations: Validating minimum standards
for muscular strength and endurance. Ergonomics, 47, 864–875.
Stauffer JM, Mendoza JL. (2001). The proper sequence for correcting correlation coeffi-
cients for range restriction and unreliability. Psychometrika, 66, 63–68.
Steiger JH, Lind JM. (1980). Statistically based tests for the number of common factors.
Paper presented at the annual meeting of the Psychometric Society, Iowa City, Iowa.
Stevenson JM, Weber CL, Smith JT, Dumas GA, Albert WJ. (2001). A longitudinal study
of the development of low back pain in an industrial population. Spine, 26, 1370–
1377.
Terman LM. (1917). A trial of mental and pedagogical tests in a civil service examination
for policemen and fireman. Journal of Applied Psychology, 1, 17–29.
United States v. City of New York, 637 F. Supp. 2d 77 (E.D.N.Y. 2009).
Wilcox RR. (2001). Fundamentals of modern statistical methods: Substantially improving
power and accuracy. New York: Springer-Verlag.
Winer BJ, Brown DR, Michels KM. (1991). Statistical principles in experimental design.
New York: McGraw-Hill.
NORMAN D. HENDERSON 1039

Winkelspecht C, Lewis P, Thomas A. (2006). Potential effects of faking on the NEO-PI-


R: Willingness and ability to fake changes who gets hired in simulated selection
decisions. Journal of Business and Psychology, 21, 243–259.
Wolff WM, North, AJ. (1951). Selection of municipal firemen. Journal of Applied Psychol-
ogy, 35, 25–29.
Copyright of Personnel Psychology is the property of Wiley-Blackwell and its content may not be copied or
emailed to multiple sites or posted to a listserv without the copyright holder's express written permission.
However, users may print, download, or email articles for individual use.

Вам также может понравиться