Вы находитесь на странице: 1из 42

VIENNA TEST SYSTEM

MANUAL
HALPERN CRITICAL
THINKING ASSESSMENT
Test Label HCTA

Version 51 – Revision 1

Mödling, February 2016


Copyright © 2010 by SCHUHFRIED GmbH
Author of the test D. F. Halpern

SCHUHFRIED GmbH, Hyrtlstraße 45, 2340 Mödling, Austria


Tel. +43/2236/42315-0, Fax: +43/2236/46597
info@schuhfried.at www.schuhfried.at
Sitz: Mödling, FN 104661p
Landesgericht Wr. Neustadt, UID Nr. ATU 19273809
HCTA

CONTENTS
1 OVERVIEW ................................................................
................................ ................................................................
................................ .............................................
................................ ............. 3

2 SUMMARY ................................................................
................................ ................................................................
................................ ..............................................
................................ .............. 4

3 DESCRIPTION OF THE TEST


TEST ................................................................
....................................................................................
................................ .................... 6
3.1 Theoretical background .................................................................................................. 6
3.2 Test structure ................................................................................................................10
3.3 Description of Variables .................................................................................................11

4 EVALUATION ................................................................
................................ ................................................................
................................ ........................................
................................ ........ 14
4.1 Objectivity ......................................................................................................................14
4.2 Reliability .......................................................................................................................14
4.3 Validity...........................................................................................................................16
4.4 Economy .......................................................................................................................28
4.5 Usefulness.....................................................................................................................28
4.6 Reasonableness ............................................................................................................28
4.7 Resistance to faking ......................................................................................................28
4.8 Fairness ........................................................................................................................28

5 NORMS ................................................................
................................ ................................................................
................................ .................................................
................................ ................. 29
5.1 Description of the norm samples ...................................................................................29

6 TEST ADMINISTRATION ................................................................


................................ ........................................................
................................ ........................ 32
6.1 Instruction and practice phase .......................................................................................32
6.2 Test phase.....................................................................................................................33
6.3 Scoring module in test form S1 and S3..........................................................................33

7 INTERPRETATION OF TEST
TEST RESULTS ................................................................
................................ ...................................
................................ ... 37
7.1 General notes on interpretation .....................................................................................37
7.2 Interpretation of the variables of HCTA ..........................................................................37
7.3 Additional output of results ............................................................................................37

8 GENERAL CONCLUSION ................................................................


................................ .......................................................
................................ ....................... 39

9 REFERENCES................................
REFERENCES................................................................
................................ ................................................................
................................ ........................................
................................ ........ 40

2
HCTA

1 OVERVIEW
Employers, educators, and the general public all agree that critical thinking is an essential
skill for citizens of the 21st century. It is the primary objective of education and a core ability
that employers want for their prospective and current employees. A study by the
Association of American Colleges and Universities (Hart Research Associates, 2013, p. 1)
found that “nearly all employers surveyed (93 percent) say that ‘a demonstrated capacity to
think critically, communicate clearly, and solve complex problems is more important than [a
candidate’s] major.” Similar results have been found in reports from a wide range of
employer and academic reports. But what is critical thinking, and how can we assess it?
The Halpern Critical Thinking Assessment (HCTA) was designed to help educators and
employers assess critical thinking skills in their students and employees.

Unique characteristics of the HCTA

The following manual explains the unique properties of the HCTA. The central question is
what makes a measure of critical thinking good? A good measure can predict how people
act in the real world. The HCTA is the only measure of critical thinking that has real world
validity—it can predict what adults do (more precisely, what they say they do) in their daily
lives. These studies are described in the validity section below.

The HCTA provides a measure of how people think when they contemplate information that
relates to real-world experiences. The HCTA uses two response formats: constructed
response—how people first respond to a situation (in their own words) and forced choice,
which is a measure of how well they can recognize a good response. It is the only measure
of critical thinking to use two types of response formats.

Even though most experts agree that constructed responses are usually the best measure
of what people actually think and do, they often avoid constructed responses because of
the time needed to grade the responses. In addition, another common problem with
constructed responses is that it is difficult to get good interrater reliabilities. The HCTA uses
a computerized grading system that guides the grader with prompts that make it easy and
relatively fast for anyone to grade constructed responses. Interrater reliabilities are very
high because of a computerized grading system that prompts the grader with questions
about the constructed responses and the numerical grading is then computed
automatically.

Finally, the HCTA is easy to use. There are several options for users. The HCTA can be
administered online or offline. If test administrators opt to administer the test online,
respondents are sent a link (via e-mail or other on-line method such as a link in a spread
sheet) which opens a screen where the respondent provides basic demographic
information and then takes the HCTA on-line. The grading of the constructed responses
can be done by the test administrator. If the test administrator opts to grade the constructed
responses, this is accomplished with the use of grading prompts. Test administrators
receive HCTA test scores (along with subscores and norms) in various output formats (e.g.,
SPSS, CSV, etc). The easiest alternative involves very little work on the part of the
administrator. Provide test takers with a link and then get results fully scored with norms.
Alternatively, test administrators who want more control over the testing environment can
opt to install the Vienna Test system on their own computers and use the system to
administer and grade the tests. The choice is yours. The HCTA is culturally fair. It is
currently being used in many countries and languages around the world with comparable
norms.

3
HCTA

2 SUMMARY
SUMMARY
Author
Diane F. Halpern, Ph.D.

Application
The HCTA is designed to assess critical thinking skills for respondents aged 18 years and
older. Main areas of application are education and personnel selection (e.g.,
preemployment testing and promotion and retention decisions).

Theoretical background
The Halpern Critical Thinking Assessment was designed to include constructs that are most
commonly listed in definitions of critical thinking. The test focuses on five dimensions of
critical thinking: verbal reasoning, argument analysis, thinking as hypothesis testing,
likelihood and uncertainty, and decision making and problem solving. Taken together, these
five dimensions constitute the skills of critical thinking.

Administration
Respondents are presented with 20 everyday scenarios. For each scenario, they first
provide brief constructed responses and then select answers from a list of possible
alternatives (forced choice options), thereby providing separate measures of recall and
recognition memory.

Test forms
HCTA offers four test forms, which are intended to be used as screening or standard
version of the test, respectively. There are two versions of the HCTA—Version A (S1 and
S2) and Version B (S3 and S4). Scenarios used in Versions A and B are analogues,
meaning that they cover the same skills (e.g., not confusing correlation with cause or
recognizing when a sample is too small) in different contexts. By having two versions,
respondents can take the HCTA twice without the possible contamination of memory for
test items. For Form S1 and S3, respondents answer questions about everyday scenarios
using both constructed responses and forced choice alternatives. Form S2 and S4 consist
of all forced choice items.

Scoring
The following variables are scored: (a) Total Critical Thinking Score, which combines
constructed and forced choice items; (b) Critical Thinking Score—Constructed Responses;
(c) Critical Thinking Score—Forced Choice Responses. There are also three separate
scores for each of the five dimension of critical thinking: verbal reasoning, argument
analysis, thinking as hypothesis testing, likelihood and uncertainty, and decision making
and problem solving. Each has a total score based on both constructed responses and
forced choice responses, a forced choice score, and a constructed response score.

Reliability
Internal consistency (Cronbach’s Alpha) lies between α=0.68 and α=0.88. A unique scoring
system allows for high reliability in scoring the constructed response items.

4
HCTA

Validity
Validity
Numerous validation studies were conducted with a wide variety of samples. As reviewed in
this document, scores increase with higher levels of education, selectivity of samples,
taking formal course work designed to enhance critical thinking, college level grades, and
scores on standardized examinations and decrease based on an assessment of how likely
people are to base their responses on preconceptions. The HCTA is the only critical
thinking assessment that has been validated by predicting what people do in their every day
lives.

Norms
Norm samples of adults are available for all test forms. The norm data of test forms S1 and
S2 were gathered between 2009 and 2014 under the guidance of the test author. The 2015
norm sample comprised 482 respondents aged between 18 and 72. The mean age was
27.21 years with a standard deviation of 10.2 years. The median age was 23 years. It
consisted of 178 (36.6%) males and 228 (46.9%) females; there were no data on the
gender of the remaining 80 (16.5%) respondents.
The norm data of test forms S3 and S4 were gathered between 2014 and 2015 under the
guidance of the test author. The 2015 norm sample comprised 313 respondents aged
between 19 and 66. The mean age was 25.23 years with a standard deviation of 8.242
years. The median age was 21 years. It consisted of 78 (24.9%) males and 132 (42.2%)
females. There were no data on the gender of the remaining 103 (32.9%) respondents.

Time required for the test


Approx. 15 – 50 minutes, depending on test form.

5
HCTA

3 DESCRIPTION OF THE TEST


3.1 Theoretical background
Although the ability to think critically has always been important, it is a vital necessity for the
citizens of the 21st century. Every generation needs more education and higher level
thinking skills than the generation that came before because the world is becoming
increasingly technical and complex. As many experts in assessment have noted, IQ tests
measure only a subset of the thinking skills that people need to be successful in life. In his
book on what intelligence tests miss, Stanovich (2009, p. 3) wrote “IQ tests are good
measures of how well a person can hold beliefs in short-term memory and manipulate
those beliefs, but they do not assess at all whether a person has the tendency to form
beliefs rationally when presented with evidence.” What we really want for our politicians,
lawyers, doctors, citizens who vote, employees at all levels, and everyone else is to gauge
their ability to think critically, which is largely absent from intelligence tests. In a series of
experiments, Stanovich (2009, p. 39) has found that “Rational thinking can be surprisingly
dissociated from intelligence.”

Employers and educators agree that critical thinking is among the most important skills for
their prospective and current employees at all levels of employment. Hart Research
Associates (2009) polled a wide variety of employers about the intellectual and practical
skills needed for today’s jobs. The skills employers want their employees to have are 1) the
ability to communicate effectively, orally and in writing (89%); 2) critical thinking and
analytical reasoning skills (81%); 3) the ability to analyze and solve complex problems
(75%); 4) the ability to innovate and be creative (70%); the ability to locate, organize, and
evaluate information from multiple sources (68%); and 5) the ability to work with numbers
and understand statistics (63%). The HCTA assesses all of these skills. Eighty-eight
percent of employers agreed with the statement, To succeed in our company, employees
need higher levels of learning and knowledge today than they did in the past. Other
employer surveys report similar results.

Multiple sources have recognized the primacy of critical thinking skills in contemporary
society. In his award-winning book, Earl Hunt (1995) examined the skills that will be needed
by our workforce in the early decades of this century and asked, “Will we be smart
enough?” Our quality of life, perhaps even the future of our planet, depends on how we
answer this question. The workforce is one critical place where we can witness the dizzying
pace of change. There is an increased demand for a new type of worker—the “knowledge
worker” or the “symbol analyst,” a phrase that is used by the United States Secretary of
Labor to describe someone who can carry out multi-step operations, manipulate abstract
and complex symbols and ideas, acquire new information efficiently, and remain flexible
enough to recognize the need for continuing change and for new paradigms for lifelong
learning. Workers in almost every job category can expect to face novel problems in a
workplace that is changing repeatedly. Familiar responses no longer work, and even newly
acquired ones will not work for long. Thus, critical thinking is essential for complex work.

The role of critical thinking in education has been emphasized by numerous recent authors
(cf. e.g. Aniandou & Claro, 2009; Trilling & Fadel, 2009). A good example for the inclusion
of critical thinking in recent frameworks for learning is the model presented by the
“Partnership for 21st century learning” (www.p21.org). In their framework for “21st century
learning”, critical thinking is named as learning and innovation skill, which helps students to
prepare for an increasingly complex life and work environment.

The HCTA was developed to assess common constructs that define critical thinking;
improve educational assessment so that learning institutions can actually measure whether

6
HCTA

they are enhancing the critical thinking skills of their students; and to refine personnel
selection and promotion processes in business and industry.

The term “critical thinking” is sometimes interpreted as a negative trait by people who are
unfamiliar with the concept because of the negative connotation of the word "critical," which
suggests criticism, negativity, opposition, and/or argumentativeness. This type of
interpretation is unfortunate, because the use of the term critical is meant to imply critique
or evaluation, which are positive traits associated with good thinking. It is intended to
connote effortful, careful, consciously controlled processing that maximizes the use of all
available evidence and cognitive strategies, and purposefully strives to overcome individual
biases (Riggio & Halpern, 2006). The term “critical” may be compared (ontologically as well)
to the word “skeptical,” which means thoughtful. Given the widespread availability of
misinformation on the internet and other popular outlets, people should always be
encouraged to be skeptical, or thoughtful, in their evaluation of incoming information and
development of judgments and decisions. Skepticism (habitual thoughtfulness) and its
application in the real world (critical thinking) is not being cynical – skepticism is a positive
trait that contributes to individual and societal well-being through its carefulness,
awareness, and evidentiary basis.

In an extensive review of the critical thinking literature, Fischer and Spiker (2000) found that
most definitions for the term “critical thinking” include reasoning/logic, judgment,
metacognition, reflection, questioning, and mental processes. Jones and his colleagues
(Jones, Dougherty, Fantaske, & Hoffman, 1995; Jones, Hoffman, Moore, Ratcliff, Tibbetts,
& Click, 1995) obtained consensus from among 500 policy makers, employers, and
educators, who agree that critical thinking is a broad term that describes reasoning in an
open-ended manner and with an unlimited number of solutions. It involves constructing a
situation and supporting the reasoning that went into a conclusion.

Critical thinking is a multidimensional construct; accordingly the assessment of critical


thinking is necessarily multidimensional. There are five category headings used for
organizing the HCTA (Halpern, 1994; 1998; 2003):

• Verbal Reasoning Skills:


The skills listed under this rubric include those skills that are needed to comprehend
and defend against the persuasive techniques that are embedded in everyday
language (also known as natural language). Thinking and language are closely tied
constructs, and the skills included in this category recognize the reciprocal
relationship between language and thought in which an individual's thoughts
determine the language used to express them, and the language that is used
shapes the thoughts. Two examples of verbal reasoning skills are the recognition
when a pejorative label is being used to sway thinking (e.g., "the conservative idea
that . . . ", or “the liberal learning politician”) and that an issue has been framed by
the using a nonstandard definition of a term that is critical in the context (e.g.,
"honest people are people who pay their taxes").
• Argument Analysis Skills:
An argument is a set of statements with at least one conclusion and one reason that
supports the conclusion. In real life settings, arguments are complex with reasons
that run counter to the conclusion, stated and unstated assumptions, irrelevant
information, and intermediate steps between the conclusions and the evidence that
supports them. Arguments are found in commercials, political speeches, textbooks,
and anywhere else where reasons are presented in an attempt to get the reader or
listener to believe that the conclusion is true. The skills of identifying conclusions,
rating the quality of reasons, and determining the overall strength of an argument
are essential in understanding complex and extended arguments.

7
HCTA

• Skills in Thinking as Hypothesis Testing:


The rationale for this category is that much of our day-to-day thinking is like the
scientific method of hypothesis testing. In many of our everyday interactions, people
function like intuitive scientists in order to explain, predict, and control the events in
their life. The skills used in thinking as hypothesis testing are the same ones that are
used in scientific reasoning - the accumulation of observations, formulation of beliefs
or hypotheses, and the use of the information collected to decide if it confirms or
disconfirms the hypotheses. Critical thinkers recognize when a critical comparison is
missing or when generalizations are made from small or biased samples.
• Using Likelihood and Uncertainty:
Because very few events in life can be known with certainty, the correct use of
probability and likelihood plays a critical role in almost every decision. The critical
thinking skills that are subsumed under this heading are an important dimension of
higher order thinking. An example of a likelihood and uncertainty skill is the
recognition that base rates are critical in determining the probability of outcomes.
• Decision Making and Problem Solving Skills:
In some sense, all of the critical thinking skills are used to make decisions and solve
problems, but the skills that are included in this category involve the use of multiple
problem statements to define the problem and identify possible goals, the
generation and selection of alternatives, and the use of explicit criteria to judge
among alternatives. Many of these skills are especially useful in quantitative
reasoning problems. An example of decision making and problem solving skills is
recognizing that alternatives need to be weighed for both positive and negative
outcomes.

Taken together, these five categories define an organizational rubric for a skills approach to
critical thinking. They have face validity and can be easily communicated to the general
public, and they offer one possible answer to the question of what students need to know
and be able to do when they enter the workforce or what employees need to know and be
able to do to advance to a stage in their career that requires higher order thinking skills.
These categories of critical thinking skills are commonly used on measures designed to
assess educational outcomes (e.g., American Psychological Association’s Outcomes
Assessment Taskforce, 2009). A skills approach to critical thinking has the benefit of
focusing on skills that are teachable, testable, and generalizable. These are skills that are
needed for success in the workplace, in the home, and in handling the other complexities of
modern life. These are not, of course, independent skills, and many real world tasks require
the use of several of these skills and often the selection of the best thinking skill for a given
task. It is useful to consider these categories as test specifications, just as one would
include some multiplication, division, addition, and subtraction problems in a test of
computational mathematics. They are not independent dimensions.

3.1.1 Concept and item development


Halpern’s (1998, 2003; 2014) model for teaching critical thinking describes the dispositions
that people must have to become critical thinkers, lists a set of critical thinking skills to be
acquired, emphasizes the importance of learning the structure of an argument or problem,
and promotes the development of metacognition, an awareness of the outcomes of one’s
cognitive processes. Critical thinking is a generalized skill that can be exhibited in a variety
of contexts and content domains. The HCTA has been in development for over two
decades, with numerous refinements and improvements in its psychometric properties. It
has been administered to multiple and diverse samples and is being tested in several
countries in multiple languages.

8
HCTA

The HCTA consists of 20 everyday scenarios, each of which is briefly described and
presented using common language. For each scenario, respondents are first asked an
open ended (i.e., constructed response) question, which is followed by a forced choice
question (e.g., multiple choice, ranking, or rating of alternatives) such as select the best
alternative, rate each of the alternatives in terms of its relevance, or indicate which two of
the following alternatives indicates a good response. Cognitive psychologists differentiate
between free recall and recognition processes in memory and these two types of questions
are designed to take advantage of the different cognitive processes. The total score is
(approximately) equally weighted between constructed response and forced choice
questions.

There are 4 scenarios for each critical thinking category - decision making and problem
solving, thinking as hypothesis testing, argument analysis, likelihood and uncertainty, and
verbal reasoning. Although there are an equal number of scenarios for each critical thinking
category, some categories were worth more total points than other categories in their
contribution to the total critical thinking score. Scenarios were written to reflect common
experiences across cultures in industrialized societies—for example understanding
information provided in a news program or considering the design of an intervention to
improve employee morale. They have been tested in many countries (e.g., United States,
Canada, Mexico, Spain, Portugal, Belgium, China, Poland, and Vietnam) where native
translators made few changes to the scenarios to reflect local cultures. The categories were
weighted as follows with the following rationale as to their relative importance and
contribution to critical thinking:

• Decision making and problem solving (approximately 31%):


In some sense all of the subtypes of critical thinking skills are involved decision
making (generating and selecting from alternatives based on relevant criteria) and
problem solving (finding solutions to a situation, or more colloquially, moving from a
start space to a goal). Because this category relies on subsets of the other critical
thinking skills (e.g., recognizing that an unlikely event is not an optimal choice when
making decisions or examining the reasons for a course of action), and at least
potentially, an almost unlimited number of options, it was weighted with more total
points than the other categories.
• Thinking as hypothesis testing (approximately 22%):
The skills of hypothesis testing are not restricted to evaluating formal research; they
are (or should be) used in multiple everyday situations. Faulty thinking often involves
hasty generalizations from small samples of behavior (e.g., a new friend is late and
the respondent generalizes that the new friend must be habitually late) or failure to
consider control conditions (e.g., a cold gets better after taking a vitamin
supplement, but there is no consideration that it might have gotten better without the
supplement).
• Argument analysis (approximately 23%):
Too often people reach conclusions without consideration of the reasons that
support or fail to support the conclusion. The ability to seek and provide reasons and
to recognize the differences between conclusions and assumptions is critical for
good thinking. It is the difference between uninformed opinions and reasoned
thinking.
• Likelihood and uncertainty (approximately 13%):
A basic understanding of probabilities and how they affect the likelihood of an
outcome and how to use probabilities in uncertain situations are an essential
component of critical thinking, but these skills are unlikely to develop beyond a
rudimentary level without formal instruction. Many concepts relating to likelihood and
uncertainty such as regression to the mean (an extreme event is likely to be

9
HCTA

followed by a less extreme event) and gambler's fallacy (if a fair coin comes up
heads in 3 flips, a tail is not more likely on the 4th flip) are counterintuitive. Thus,
although these are important concepts, the likelihood and uncertainty category was
given a lower weight than some of the other categories so as not to penalize test
takers who have not had any formal education in understanding likelihood and
uncertainty.
• Verbal reasoning (approximately 11%):
The ability to understand how natural language influences thinking is also an
essential component of critical thinking, but it was given a lower weighting, in part,
because the connotation of words varies among languages (e.g., a word like 'tramp"
is a difficult concept to convey is many languages). The relative weighting was
designed so as not to penalize test takers whose native language is not English or
for other language versions where connotations of a single word can slant a
question.

3.2 Test structure

HCTA offers two standard versions (S1 for HCTA Form A and S3 for HCTA Form B) and
two short versions (S2 for HCTA Form A and S4 for HCTA Form B).

Standard Test Forms (S1 and S3):


S3) :
HCTA presents test-takers with 20 everyday scenarios that are common situations in the
lives of individuals. Topic selection for the scenarios was made by consensus with input
from test-takers and faculty in a variety of disciplines over a period of several years early in
the test development. In general, the scenarios were drawn from multiple disciplines such
as medical research, social policy analysis and numerous other disciplines. These
scenarios are examples of situations that might be found in newspapers and in everyday
conversations. Here is a hypothetical scenario similar to those presented in the HCTA:

Adult alcoholics often suffer from depression. A therapist suggested that one way
of helping these individuals become "clean and sober" is to relieve their
depression with the use of anti-depression drugs.

Test-takers are asked to comment on the quality of this possible treatment. A good
response will recognize that the correlation between depression and alcoholism does not
mean that depression causes alcoholism or that relieving depression will necessarily cure it.

Each short scenario is followed by specific questions that probe for the thinking that is
involved when confronted with the situation that is described. The questions represent five
categories of critical thinking skills: verbal reasoning (e.g., recognizing the use of pervasive
or misleading language), argument analysis (e.g., recognizing reasons, assumptions, and
conclusions in arguments), thinking as hypothesis testing (e.g., understanding sample size,
generalizations), using likelihood and uncertainty (e.g., applying relevant principles of
probability such as base rates), as well as decision making and problem solving (e.g.,
identifying the problem goal, generating and selecting solutions among alternatives).

The choice of the item format of these questions differs between the two test forms. Unlike
other tests of critical thinking, test forms S1 and S3 use both open ended/constructed
response and forced choice questions. Both response formats have advantages and
limitations. Furthermore, there is evidence that multiple-choice and open-ended responses
are measuring separable cognitive abilities (Bridgeman & Moran, 1997). Open-ended (also
known as constructed response) measures can show what respondents actually think when

10
HCTA

confronted with a scenario. It would be ironic to attempt to assess “reasoning in an open-


ended manner,” which is one definition of critical thinking, using only forced choice
alternatives. Forced choice responses can show whether respondents can recognize good
answers when they are presented to them. Thus, the cognitive skills measured by means of
the constructed response format and the multiple choice item format can be characterized
as follows:

Constructed response:
Constructed response questions attempt to reveal more of the dispositional component of
thinking, as they allow test-takers to demonstrate whether they are inclined to apply the
appropriate skills (Ku, 2009). Essentially, the constructed response format measures “free
recall” as there are few constraints on the type of response that the test-taker may
generate. It requires test-takers to consciously search and select appropriate knowledge
and skills from their own memory in constructing an answer. Thus, the constructed
response items require higher-level cognitive processing. The disadvantage of constructed
response questions is that they could benefit people with good writing skills, and thus may
underestimate the critical thinking skills of mediocre writers. However, recall that employers
and educators want respondents who can communicate clearly and the use of constructed
response alternatives also assesses written communication skills, and thus this
disadvantage is often considered an advantage, depending on the goals of the test
administrator.

Forced choice:
After responding to a constructed response prompt, test-takers are then asked "to select
the best alternative" from a short list of alternatives. Thus, they are presented with forced
choice questions pertaining to the same scenario. The forced choice items measure
recognition memory and require test-takers to identify the appropriate response from a
given list of alternatives (Ku, 2009). The multiple choice or multiple rating items thus
demonstrate whether the respondent was able to recognize the appropriate answer when it
is presented in a list of alternatives. Therefore recognition constitutes a lower-level cognitive
skill. Forced choice items are also less ecologically valid since there are few instances in
real life where people are presented with an array of answers from which to select.

The time required for theese test forms is approx. 50 minutes.

Short Test Forms (S2 and S4)


These test forms consist of the same 20 everyday scenarios used in shortest forms.
However, in contrast to the standard test forms, this test form only administers the multiple
choice questions to provide a shorter and easy to use screening tool for assessing critical
thinking. Although these test forms are less informative than test forms S1 and S3, they
take considerably less time to complete.

The time required for theese test forms is approx. 15 minutes.

3.3 Description
Description of Variables
Variables
Theinformation provided about the test-taker depends on the Form of the HCTA that is
used. The Short Forms comprise recognition variables (i.e., forced choice responses), the
full (standard) forms allow the calculation of all of the variables listed below.

11
HCTA

Main variables

CT: Critical Thinking


Sum of ‘Critical Thinking – recognition’ and ‘Critical Thinking – free recall’ (158 points
possible)

CTR: Critical Thinking - recognition


Sum of ‘Verbal Reasoning – recognition’, ‘Argument Analysis – recognition’, ‘Thinking as
Hypothesis Testing – recognition,’ ‘Likelihood and Uncertainty – recognition’ and ‘Decision
making and Problem Solving – recognition’ (80 points possible)

CTF: Critical Thinking – free recall


Sum of ‘Verbal Reasoning – free recall’, ‘Argument Analysis – free recall,’ ‘Thinking as
Hypothesis Testing – free recall,’ ‘Likelihood and Uncertainty – free recall’ and ‘Decision
making and Problem Solving – free recall’ (78 points possible)

Differentiated (component) variables

VR: Verbal Reasoning


Sum of ‘Verbal Reasoning – recognition’ and ‘Verbal Reasoning - free recall’ (17 points
possible)

VRR: Verbal Reasoning - recognition


Raw score of the forced choice verbal reasoning questions (5 points possible).

VRF: Verbal Reasoning – free recall


Raw score of the constructed response verbal reasoning questions (12 points possible).

AA: Argument Analysis


Sum of ‘Argument Analysis – recognition’ and ‘Argument Analysis - free recall’ (37 points
possible)

AAR: Argument Analysis - recognition


Raw score of the forced choice argument analysis questions (18 points possible).

AAF: Argument Analysis – free recall


Raw score of the constructed response argument analysis questions (19 points possible).

HT: Thinking as Hypothesis Testing


Sum of ‘Thinking as Hypothesis Testing – recognition’ and ‘Thinking as Hypothesis Testing
- free recall’ (34 points possible)

HTR: Thinking as Hypothesis Testing - recognition


Raw score of the forced choice thinking as hypothesis testing questions (20 points
possible).

HTF: Thinking as Hypothesis Testing – free recall


Raw score of the constructed response thinking as hypothesis testing questions (14 points
possible).

12
HCTA

LU: Likelihood and Uncertainty:


Sum of ‘Likelihood and Uncertainty – recognition’ and ‘Likelihood and Uncertainty - free
recall’ (21 points possible)

LUR: Likelihood and Uncertainty - recognition


Raw score of the forced choice likelihood and uncertainty questions (6 points possible).

LUF: Likelihood and Uncertainty – free recall


Raw score of the constructed response likelihood and uncertainty questions (15 points
possible).

PS: Decision Making and Problem Solving


Sum of ‘Decision Making and Problem Solving – recognition’ and ‘Decision Making and
Problem Solving - free recall’ (49 points possible)

PSR: Decision Making and Problem Solving - recognition


Raw score of the forced choice decision making and problem solving questions (31 points
possible).

PSF: Decision Making and Problem Solving – free recall


Raw score of the constructed response decision making and problem solving questions (18
points possible).

Additional variable

BT: Working time


This is the time taken to complete the test battery in minutes and seconds.

13
HCTA

4 EVALUATION
EVALUATIO N
4.1 Objectivity

Administration objectivity
Objectivity in test administration and scoring exists when the respondents’ test behavior,
and thus their test score, is independent of variations (either random or systematic) in the
behavior of the test administrator (Kubinger, 2003). Because administration of the HCTA is
computerized, all respondents receive the same information, presented in the same way,
regarding the test. These instructions are independent of the test administrator. Similarly,
test presentation is identical for all respondents.

Scoring objectivity
For the multiple choice portion, the data recording and analysis are computerized and norm
comparisons are also carried out automatically. Thus, computational errors are excluded. In
order to enhance the scoring objectivity of the open ended/constructed response portion, a
standardized scoring procedure has been implemented. This scoring procedure requires
the rater to answer simple questions concerning the answers provided by the respondents.
Corresponding scores are automatically assigned by the scoring module. There is always
the potential problem of unintended bias when grading constructed response items (i.e.,
when the grader is not blind as to the respondent’s identity). This concern is minimal given
the computerized grading prompts. Empirical studies on the inter-rater reliability of the
HCTA indicated that the scoring module results in a high level of inter-rater reliability of the
open ended/constructed response portion of HCTA (See Table 1). Taken together, the
available evidence indicates that scoring objectivity can be assumed for both the multiple
choice portion as well as the open ended/constructed response portion of HCTA.

Interpretation objectivity
The interpretation of the results with the HCTA is objective because it is based on test
norms (Lienert & Raatz, 1998). Interpretation objectivity does, however, also depend on the
care with which the guidelines on interpretation given in the chapter “Interpretation of Test
Results” are followed.

4.2 Reliability

Inter-
Inter- Rater Reliability
The inter-rater reliability was calculated with a sample comprising 200 respondents aged
between 18 and 72 (mean=32.14; SD=15.22). This was a subset of the standardization
sample of HCTA Form A, who worked on a 25 item version of this test. There were no data
on the gender of 25% of the sample. The gender composition for the rest of the sample was
38% male (62% female). A total of 50 respondents attended (an open admissions)
community college, 50 respondents attended state university, 50 respondents attended a
private liberal arts college and 50 respondents were recruited from the local community,
with a wide range of educational levels.

Table 1 shows the inter-rater reliabilities for the constructed response portion (free recall)
and the total scores of test form S1.

14
HCTA

Table 1: Inter-rater reliabilities of the free recall and total scores of the main variables and differentiated results
of HCTA/S1

Test variable Total


Critical Thinking 0.93
Critical Thinking – free recall 0.83
Verbal Reasoning 0.74
Verbal Reasoning – free recall 0.60
Argument Analysis 0.88
Argument Analysis – free recall 0.70
Thinking as Hypothesis Testing 0.91
Thinking as Hypothesis Testing – free recall 0.75
Likelihood and Uncertainty 0.88
Likelihood and Uncertainty – free recall 0.82
Decision Making and Problem Solving 0.74
Decision Making and Problem Solving – free recall 0.53

The results indicate acceptable inter-rater-reliabilities for the constructed response portion
with the exception of the scales ‘verbal reasoning’ and ‘decision making and problem
solving’. However, the inter-rater reliability of the main variable ‘critical reasoning – free
recall’ is sufficiently high (r = 0.83). Furthermore, the inter-rater reliability of the total score,
which combines free recall and recognition turned out to be rather high at r = 0.93. When
calculating the inter-rater reliability of the main variable ‘Critical thinking’ separately for the
four subsamples the inter-rater reliabilities varied from 0.83 to 0.96.

In a next step we calculated paired sample t-tests to evaluate the effect of the rater on
mean scores. The results are summarized in Table 2.

Table 2: Mean differences between the two raters in the main variables and differentiated results of HCTA/S1

Test variable t df P Cohen’s d


Critical Thinking 0.375 199 0.708 <0.01
Critical Thinking – free
-0.227 199 0.821 <0.01
recall
Verbal Reasoning -0.626 199 0.532 0.03
Verbal Reasoning – free
-1.015 199 0.311 0.07
recall
Argument Analysis 1.917 199 0.057 0.07
Argument Analysis – free
1.278 199 0.203 0.08
recall
Thinking as Hypothesis
2.188 199 0.030 0.07
Testing
Thinking as Hypothesis
1.942 199 0.054 0.09
Testing – free recall
Likelihood and
-1.694 199 0.092 0.07
Uncertainty
Likelihood and
-1.760 199 0.080 0.08
Uncertainty – free recall
Decision Making and
-0.999 199 0.319 0.05
Problem Solving
Decision Making and
Problem Solving – free -1.429 199 0.155 0.10
recall

The results indicated that the means of the two main variables ‘Critical Thinking’ and
‘Critical Thinking – free recall’ were not significantly affected by the two raters. This argues

15
HCTA

for the stability of the results obtained with HCTA/S1 across raters and the scoring
objectivity of the test. Furthermore, there were no significant mean differences in the total
scores and the free recall scores of the five sub-scales of HCTA between the two raters.
Only the scores of the differentiated results for the scale ‘Thinking and Hypothesis Testing’
exhibited a significant effect of the rater. However, the effect size indicated that the
influence of the rater on the scale score was neglectable (Cohen’s d = 0.07).

Taken together, the results indicate that the main variable ‘critical thinking’ and ‘critical
thinking – free recall’ were highly correlated across raters and there were no reliable mean
differences in the scores granted by the two different raters on the basis of the same test
protocols. A similar argument can be made for the total scores and free recall scores of the
individual sub-scales, although the correlation across raters was substantially lower. This is
particularly true for the free recall portion of the subscales ‘verbal reasoning’ and ‘decision
making and problem solving’. Nevertheless, the total scores of these subscales turned out
to be sufficiently stable across raters as indicated by the correlation between the two raters
and the lack of significant mean differences in the scales’ total scores.

No separate studies on the inter-rater reliability of Form B (S3) have been carried out yet.
However, since Form A and Form B use comparable item formats, it can be assumed that
the results of Form A can be generalized to Form B.

4.2.1 M easurment Reliability


Reliability describes a test’s internal consistency (how closely related a set of items are as a
group). The reliability of the main variables of both test forms was evaluated in terms of
Cronbach α coefficients. The reliability coefficients were calculated on the basis of the
standardization samples. The results obtained are summarized in Table 3.

Table 3: Internal consistencies (Cronbach α) of the main variables of HCTA

HCTA Form A(S1) HCTA Form B (S3)


Test variable Cronbach α Cronbach α
Critical Thinking 0.88 0.82
Critical Thinking – 0.68
0.77
recognition
Critical Thinking – free 0.78
0.83
recall
HCTA Form B (S2) HCTA Form B (S4)
Test variable Cronbach α Cronbach α
Critical Thinking 0.77 0.68

The results indicate a sufficient measurement precision of both test forms of the HCTA.
Furthermore, the measurement precision of the main variables ‘Critical Thinking –
recognition’ and ‘Critical Thinking – free recall’ are sufficient to enable a differentiated
interpretation of recognition and free recall competencies when using the standard test form
at the level of individual respondents.

4.3 Validity

Modern concepts of validity acknowledge the need to provide multiple validity evidence to
support conclusions drawn from test scores (cf. Embretson, 2005; Kane, 1992, 2001;
Messick, 1995; Wainer & Braun, 1988).

16
HCTA

Research on three facets of validity; namely content validity, construct validity and criterion
validity are summarized in this section. Because most of these studies have been published
in peer-reviewed journals and other academic outlets, only the main findings and necessary
background information are provided. For additional information about any of these studies,
please see the related publications.

4.3.1 Content V alidity


The HCTA was designed to match constructs that are most commonly listed in definitions of
critical thinking - the ability to reason effectively with (natural) language, analyze arguments
including the ability to identify and provide reasons that support conclusions and
assumptions, use assessment of likelihood and uncertainty appropriately, use the principles
of hypothesis testing in everyday situations, and solve problems and make decisions. Thus
content validity can be assumed due to the item design process.

4.3.2 Construct-
onstruct - and Criterion V alidity
In general, construct validity determines what test scores mean. It is concerned with the
cognitive processes respondents utilize to solve the test items and the theoretically
expected correlations between construct-related and construct-unrelated test scores (cf.
Embretson, 2005; Messick, 1995).
This section reports the results of various studies on the validity of HCTA. Unless stated
otherwise, these studies used the 25 item version of HCTA.

Internal structure of the Halpern Critical Thinking Assessment:

The Halpern Critical Thinking Assessment has been developed to measure critical thinking
skills using two item formats: (1) constructed responses and (2) multiple choice.

The main idea behind using these two item formats was that constructed response
measures how well respondents can actively utilize critical thinking skills in the five domains
covered by the HCTA. By contrast, the multiple choice item formats assess whether
respondents are able to recognize good usage of critical thinking skills in these five
domains. Thus, constructed response and multiple choice scores are hypothesized to
measure correlated, albeit distinguishable critical thinking skills.

First evidence in favor of this hypothesis has been obtained in four small scale studies that
have been conducted in the course of the test development process. All four studies
evaluated the correlation between the free recall and forced choice critical thinking.

The first study was conducted at a selective Catholic University in Southern California,
USA. A total of 98 students (number of males = 60) took the HCTA as part of a course
requirement. The mean for this sample was 109 (SD = 10.3, range = 77 to 135).The
correlation between constructed response and forced choice scores was .39. This suggests
that the free response and multiple choice scores are correlated, but also separable
aspects of critical thinking skills.

A similar study used the Spanish language translation of the HCTA with a community
sample of 355 working adults in Spain (number of males = 224). The mean for this sample
was 106 (SD = 16.1, range = 60 to 149). The correlation between constructed response
and forced choice scores was .49, again showing a substantial relationship between these
two measures, while also supporting the idea that they are separable. Cronbach alpha for
this Spanish language version with community adults was .78.

In the third study a Dutch language translation of the HCTA was administered to all first
year university students majoring in education at a university in Belgium (Francois,

17
HCTA

Verburgh, & Elen, manuscript under review). One hundred seventy-three students (number
of males = 11) took the Dutch version of the HCTA. The mean total score for this sample
was 113.15 (SD = 11.5, range = 50 to 142). The correlation between constructed response
items and forced choice items was .42 again suggesting that although these two answer
formats are related, two different constructs are being evaluated with each questions type.

The fourth and final small scale study was conducted with an U. S. Community College
sample. In the United States, community colleges are typically 2-year colleges with open
admissions policies (i.e., nonselective admissions). Fifty students took the HCTA on-line.
They ranged in age from 19 to 65 years, with mean age = 25.78 years (8.27 years standard
deviation). The mean total HCTA score was 101 (SD = 17.39,). The correlation between
their constructed response items and free recall items was .51.

Further evidence on the separability of critical thinking skills measured by means of a


constructed response format and a multiple choice format, respectively, has been obtained
in two studies on the factorial structure of the HCTA.

The first study evaluated the factor structure of the HCTA using the current U.S. norm
sample by means of confirmatory factor analyses. A detailed description of the sample
characteristics can be obtained in chapter 5.1.

Three competing measurement models were specified and compared to each other in
terms of their global goodness of fit to the norm data. In the first measurement model (M1)
scale scores in the five domains covered in the HCTA that used a constructed response
item format were hypothesized to load on a latent “critical thinking – free recall”-factor. The
five sub-scale scores obtained by means of a multiple choice item format marked a latent
“critical thinking – recognition”-factor. Both latent factors were allowed to be correlated to
reflect the hypothesis that the constructed response and multiple choice portion of the
HCTA measure two correlated, albeit separable facets of critical thinking. Furthermore,
correlated uniqueness between the corresponding domain scores were allowed to be freely
estimated because the constructed response and multiple choice items each share the
same item stem. The second measurement model (M2) constitutes a more restrictive
version of model M1. In this model the standardized latent correlation between the “critical
thinking – free recall”-factor and the “critical thinking – recognition”-factor was set to 1. This
was done to test the hypothesis that the “critical thinking – free recall”-factor and the “critical
thinking – recognition”-factor are indistinguishable at the level of the latent traits. The third
measurement model (M3) also imposed a restriction on the standardized latent correlation
between the “critical thinking – free recall”-factor and the “critical thinking – recognition”-
factor. However, this time the standardized latent factor correlation was set to 0 to test the
hypothesis that “critical thinking – free recall” and “critical thinking – recognition” constitute
entirely separate latent traits.

Since all measures met the criteria for univariate normality (all skewness≤2; all kurtosis≤2)
maximum likelihood was used to estimate the model parameters. The calculations were
carried out with AMOS 5.0 (Arbuckle, 2003). The global goodness of fit of the three
competing measurement models was evaluated using the following cut-off values: non-
significant χ²-test, CFI≥.95, and RMSEA≤.06 (cf. Hu & Bentler, 1999; Jackson, Gillaspy &
Purc-Stephenson, 2009; Marsh, Hau, & Wen, 2004). Furthermore, since model M2 and
model M3 were both nested within model M1 the fit of both measurement models was also
compared to its less restrictive precursor model M1 using the following criteria: non-
significant Δχ² statistic and ΔCFI≤.01 (Cheung & Rensvold, 2002). The model fit statistics of
the three competing measurements are summarized in Table 4.

18
HCTA

Table 4: Goodness of fit of three competing measurement models of HCTA

Measurement models in the HCTA norm sample


Model χ² df p CFI RSMEA Δχ² df p ΔCFI
M1 52.127 29 0.005 0.978 0.042 [0.023; 0.060] -- -- -- --
M2 73.032 30 <0.001 0.096 0.057 [0.040; 0.073] 20.905 1 <0.001 0.020
M3 294.246 30 <0.001 0.744 0.140 [0.126; 0.155] 242.119 1 <0.001 0.234

The results indicated that model M1, which assumed two correlated, but also separable
latent traits fit the data well. Furthermore, this model also fit the data significantly better than
model M2, which reflected the hypothesis that the two latent factors “critical thinking – free
recall” and “critical thinking – recognition” are indistinguishable. Model M3 did not only fit
the data less well than model M1, but also failed to fit the data by conventional criteria. The
standardized factor loadings and structural relations can be seen in Figure 1.

Figure 1: Standardized factor loadings of the HCTA measurement model. CTF= latent factor critical
thinking – free recall, CTR=latent factor critical thinking – recognition, VRF= sub-scale score verbal
reasoning – free recall, AAF= sub-scale score argument analysis – free recall, HTF= sub-scale score
hypotheses testing – free recall, LUF= sub-scale score likelihood and uncertainty – free recall, PSF=
sub-scale score decision making and problem solving – free recall, VRR= sub-scale score verbal
reasoning – recognition, AAR= sub-scale score argument analysis – recognition, HTR= sub-scale
score hypotheses testing – recognition, LUR= sub-scale score likelihood and uncertainty –
recognition, PSR= sub-scale score decision making and problem solving – recognition.

The factor loadings of all constructed- and multiple choice sub-scales on their
corresponding latent factors were significant and can be regarded as medium to high in
magnitude. The correlations between the uniqueness variances of corresponding sub-
scales measured by means of the two item formats also reached significance with the
exception of the correlation between the uniqueness variances of the scales “Likelihood
and uncertainty – free response” and “Likelihood and uncertainty – recognition”. However,
the size of the correlated uniqueness variances turned out to be rather small. Most
interestingly, the standardized latent correlation between the “critical thinking – free recall”-
factor and the “critical thinking – recognition”-factor was rather large (r = 0.871), albeit
significantly different from 1 (cf. model M1 versus model M2).

19
HCTA

Taken together the results support the hypothesis that critical thinking skills measured by
means of either a constructed response item format or a multiple choice item format
represents related, yet separable facets of critical thinking. Furthermore, the size of the
latent factor loading between the “critical thinking – free recall”-factor and the “critical
thinking – recognition”-factor suggests that the calculation of a global critical thinking score,
which subsumes critical thinking skills demonstrated in both item formats is warranted.

The second study investigating the factor structure of the HCTA was conducted in the
course of assessing the cross cultural invariance of the HCTA across the United States of
America and Hong Kong (Hau et al., 2006; Ku et al., 2006). One hundred forty-two
undergraduates from a highly selective Hong Kong university (43 Male, 99 Female; 17%
Arts, 23% Science, 40% social science, 20% others) and 153 U.S. undergraduates from a
nonselective public university in Southern California (30 Male, 121 Female, 2 gender
information missing; 5% Arts, 17% Science, 77% Social Science) participated in this study.
In the Chinese sample students also completed the NEOFFI (a five-factor personality
inventory) and provided information on their gender, GPA (M=2.89, SD=0.56) and university
major. In the U.S. sample SAT-verbal (M = 418, SD=93), SAT-Math (M=441, SD=92)
scores were obtained from their college records. Results on the convergent-, divergent- and
criterion validity of HCTA are reported elsewhere in this manual (cf. section convergent and
discriminant validity, study 4). In this section results on the factor structure of HCTA and its
invariance across culture are reported.

The authors evaluated the fit of several factor analytic models using standardized item
scores in the Chinese sample. The first model tested assumed ten separable latent factors,
which represent the five domains of critical thinking skills assessed by means of
constructed response and multiple choice item formats, respectively. The five constructed
response domain factors and the five multiple choice domain factors were allowed to be
fully intercorrelated. This model will be referred to as model M1. The second model (model
M2) replaced the correlations between the latent factors with two higher-order factors
reflecting “critical thinking – free recall” and “critical thinking – recognition”, respectively.
These two latent higher-order factors were allowed to be correlated. Model M3 is based on
model M2 but specified additional correlations between corresponding domain factors in the
multiple choice and constructed response item format over and above already existing
relations between the two higher-order factors to reflect the hypothesis that the five
domains represent separable, albeit correlated, facets of critical thinking skills. The fourth
model (M4) represented a more general version of model M3, which also allowed for
correlated uniqueness between corresponding items in both item formats to reflect the use
of an identical item stem in multiple choice and constructed response format items. The
model parameters were estimated by means of maximum likelihood estimation. The global
goodness of fit of the four competing measurement models was evaluated using the
following cut-off values: non-significant χ²-test, CFI≥.95, and RMSEA≤.06 (cf. Hu & Bentler,
1999; Jackson, Gillaspy & Purc-Stephenson, 2009; Marsh, Hau, & Wen, 2004). The results
are summarized in Table 5.

Table 5: Goodness of fit of four competing measurement models at the item level in the Chinese sample

Model χ² df p CFI RSMEA


M1 2031.87 1155 <0.001 0.681 0.035
M2 1825.85 1164 <0.001 0.762 0.034
M3 1697.61 1159 <0.001 0.807 0.029
M4 1360.15 1134 <0.001 0.919 0.019

The results indicated that model M4, which assumes ten separable first-order factors
representing the five critical thinking domains in both item formats (including correlations
between matching first-order factors and correlated uniqueness between matching items in
both item response formats) and two correlated higher-order factors fit the data best. The

20
HCTA

authors reported large higher-order to domain facet factor loadings (average loading =
0.863) and a strong relation between the higher-order “critical thinking – free response” and
“critical thinking – recognition” factors (r = 0.93).

In general, the results obtained at the item level in the Chinese sample showed that despite
evidence of sufficient differentiation between the five critical thinking domains and the two
higher-order factors “critical thinking – free response” and “critical thinking – recognition”
calculating a global critical thinking score is well justified given the high, albeit not perfect,
correlation between these two higher-order factors even after accounting for correlated
uniqueness and other domain- and item format specific relations.

The authors also evaluated the factor structure of the HCTA at the level of the ten sub-
scales in the American and Chinese sample and tested the measurement invariance of this
model across the two samples (Hau et al., 2006). The model tested in both samples
(American versus Chinese sample) corresponded to the measurement model M1 described
in the study evaluating the factor structure of the HCTA using the current American norm
sample. The model parameters were estimated by means of maximum likelihood
estimation. The global goodness of fit of the four competing measurement models was
evaluated using the classic criteria outlined above. In a first step, the authors tested for
configural invariance by restricting the general factor structure to be equal in the American
and Chinese samples. This model will be referred to as configural invariance model (CI).
Next, equality constraints were imposed on the factor loadings to test the hypotheses that
all domain scales contribute to a comparable extent to the higher order “critical thinking –
free recall” and “critical thinking – recognition” factors in both samples. This measurement
invariance model is commonly called weak measurement invariance model (WI). The third
model also imposed equality constraints on the correlated uniqueness terms and the
correlation between the latent “critical thinking – free recall” factor and the “critical thinking –
recognition” factor. This model will be called weak invariance + structural invariance model
(WI+S). In the fourth and final model, the authors also restricted the uniqueness variances
of the 10 HCTA domain scales to be equal across both samples to examine, whether
measurement precision can be assumed to be equal across both samples at the level of the
10 HCTA domain sub-scales. The model parameters were estimated by means of a
maximum likelihood method and the fit of the four successively more restricted models was
evaluated using the following cut-off values: non-significant χ²-test, CFI≥.95, RMSEA≤.06
(cf. Hu & Bentler, 1999; Jackson, Gillaspy & Purc-Stephenson, 2009; Marsh, Hau, & Wen,
2004) and non-significant Δχ² statistic and ΔCFI≤.01 compared to the less restrictive
precursor of each model (Cheung & Rensvold, 2002). The results are summarized in Table
6.

Table 6: Goodness of fit of four successively more restricted measurement invariance models across America
and China

Measurement models in the HCTA norm sample


Model χ² df p CFI RSMEA Δχ² df p ΔCFI
CI 64.31 58 0.265 0.991 0.021 -- -- -- --
WI 64.31 68 0.604 0.964 0.048 0.00 10 0.999 0.027
WI+S 92.60 69 0.031 0.965 0.047 28.29 11 0.003 0.026
SU 140.74 84 <0.001 0.916 0.064 76.43 16 <0.001 0.048

The results indicated that the configural invariance model fit the data well. Thus, the factor
structure of the HCTA can be generalized across both samples; indicating that the factor
structure of the test is identical in America and China. Restricting the factor loadings to be
equal across the American and Chinese sample did not deteriorate the general model fit
much. Most interestingly, further restricting the structural parameters (correlated
uniqueness and factor correlation) to be equal slightly reduced the model fit compared to
the weak invariance model. However, imposing additional equality constraints on the

21
HCTA

uniqueness variance led to a significant decrease in model fit. Thus, the results indicated
that the general factor structure and the factor loadings can be assumed to be invariant
across both samples. However, they also indicate the need to provide culture-specific
norms and to evaluate the psychometric characteristics of adaptations of HCTA in another
culture.

In sum, results on the factor structure of HCTA indicated that the proposed hierarchical
model of critical thinking skills with item format-specific domain facets (k=10) at the lowest
level and two separable, albeit highly correlated, second-order factors representing “critical
thinking – free recall” and “critical thinking – recognition” provides an adequate account of
the internal structure of HCTA. Due to the high correlation between the two latent factors
“critical thinking – free recall” and “critical thinking – recognition” the calculation of a global
critical thinking score also seems to be justified.

Convergent and discriminant validity of the Halpern Critical Thinking Assessment:

In addition to research conducted on the factor structure of the HCTA numerous studies
also provide evidence on the convergent and discriminant validity of the HCTA.

A good test of critical thinking would be expected to correlate positively with measures of
academic success (standardized tests and advancement to higher levels of education).
Because critical thinking is hypothesized to account for a portion of academic success, the
size of the correlations should vary depending on the reliability of the criterion scores and
its conceptual overlap with the critical thinking construct. Grades in school, for example,
may be weakly related to critical thinking because school grades include noncognitive
variables such as attendance, turning in work on time, and personality traits such as being
friendly. School grades may also depend on rote memorization (e.g., the multiplication
tables or rote recall of information provided in class) and to the extent that this is true, low to
moderate correlations with critical thinking would be expected.

Convergent validity was assessed is a study in which 80 high school juniors and seniors in
California were compared with 80 college students at a nonselective state university (details
are provided in Halpern, 2007). The high school students were all of the students enrolled
in high school classes that were randomly selected for inclusion in this study. The college
students were in freshman level (first year) college classes. They participated in this study
to fulfill a course requirement. There were an equal number of males and females in each
group. In addition to the HCTA, all students took the Arlin Test of Formal Reasoning, which
assesses the level of cognitive development using the framework developed by the
developmental psychologist Jean Piaget. As expected, the sample of college students
scored higher than the high school students (all p’s < .001) on both the HCTA and Arlin
Test. (Because of small changes in grading between studies, mean values are not
reported.) Correlations between the HCTA and Arlin Test were .32 for both groups. There
were no significant differences among college students based on major, but the number in
each major was small in this study (18 social science, 16 natural science, 17 humanities, 20
business, with the remainder undecided about college major). There were no gender
differences for either group. Cronbach alphas for the high school and college samples
respectively were .79 and .76.

A second study of convergent validity examined the relationship between the HCTA and
standardized tests for admissions to undergraduate and graduate programs. Researchers
also assessed the relationship of the HCTA to two personality variables—need for cognition
and conscientiousness. The HCTA is a cognitively demanding test in that respondents need
to think about their responses to various scenarios and maintain attention for 45 to 80
minutes (using the longer 25-question version). One possibility is that the HCTA is also
assessing conscientiousness, which is important to success in many areas. One hundred

22
HCTA

forty-five college students from a non-selective state university (55 males and 90 females)
and 32 second year Master's Level students (15 males and 17 females) from the same
state university took the HCTA. All students participated to fulfill a course requirement.
Additional data that were collected include information about age, major, grade point
average, and scores on the SATs, and GREs. In addition, all participants took the Need for
Cognition Scale, an 18-item self-rating scale (Cacioppo, Petty, & Kao, 1984). It is designed
to assess an individual’s tendency to “engage in and enjoy effortful cognitive endeavors”
(Cacioppo, Petty, Feinstein, & Jarvis, 1996, p. 17). Participants rate items using a 7-point
scale to indicate the extent to which the items in this assessment are an accurate
description of themselves. Finally, the 20-item “Conscientiousness Scale” (Costa &
McCrae, 1992) was administered. It is one of the assessments used with the “Big 5” Factor
Theory of Personality.

The Halpern model of critical thinking (2003) posits a disposition to think critically as
important for distinguishing the ability to think critically from the likelihood that someone will
actually use their critical thinking skills in situations where the skills are needed. Being
conscientious was hypothesized to be a personality trait that would be positively related to
scores on the HCTA. Participants rated items using a 7-point scale to indicate the extent to
which the item is an accurate description of them. A Cronbach alpha reliability coefficient
was computed using 10 scale scores as the items (constructed response and multiple-
choice for argument analysis, hypothesis testing, probability and likelihood, verbal
reasoning, and creative thinking), alpha = .81. Grading of the HCTA has been modified
slightly for use with the Vienna Test System, so mean values from this study are not directly
comparable to those that would be obtained using the Vienna Test System. The graduate
student sample scored significantly higher than the undergraduate student sample (Mean
for graduate students on the constructed response = 35 (SD = 1.5) and on the forced
choice 64 (SD = .81); Mean for undergraduate students on the constructed response = 29
(SD= .69) and on the forced choice 57 (SD = 9.5), which is a statistically significant
difference, p < .05). Correlations with total scores on the HCTA were computed for Need for
Cognition Scale (r =.34), conscientiousness Scale (r = .02), Grade Point Average (GPA, r
=.35; undergraduates; grades for graduate students showed too little variability to compute
a correlation since most graduate students received a grade of A in their courses), SAT-
Verbal (r = .58), SAT-Mathematics (r = .50). GPA is a weighted average of the grades in all
academic courses. It is commonly used in the United States, most often (but not always)
with scales running from 0 (failed course) to 4 (Top grade of A in course). Correlations with
GRE are for graduate students only: GRE-Analytic, which is no longer administered (r =
.59), GRE-Verbal (r = .12), and GRE-Quantitative (r = .20). There were no gender
differences for either group. Thus, the HCTA was positively correlated with Need for
Cognitive, undergraduate grades, with a close to zero correlation with conscientiousness.
Moderate correlations were obtained with standardized college and graduate school
admissions tests and graduate students obtained significantly higher scores on the HCTA
than undergraduates.

A third study compared convergent validity for a Chinese sample and a sample of U.S.
college students (Hau et al., 2006; Ku et al., 2006). The sample and general study design
characteristics of this study were already presented in the section describing the second
study evaluating the factor structure of the HCTA and its invariance across culture In
addition to assessing the factor structure, HCTA scores were compared with grade point
averages and scores on standardized college admissions tests. These researchers found
that SAT-verbal scores predicted Argument Analysis (open-ended, beta =.58) and
Likelihood/Uncertainty skills (forced-choice, beta =.85), SAT-math predicted Hypothesis
Testing (open-ended, beta =.40), Probability/Uncertainty (open-ended, beta =.52), and
Argument Analysis skills (forced-choice, beta =.70). Correlations with grade point average
(GPA) were not statistically significant in this study. Thus, scores on the HCTA had higher
correlations with SAT scores than with GPA, which supports the general belief that SAT

23
HCTA

targets curriculum-free aptitudes while GPA reflects diligence and persistence more than
critical thinking abilities.

A fourth study, also conducted with a Chinese sample (using the Chinese language version
of the HCTA) used the HCTA along with a test of cognitive ability (the Verbal
Comprehension Index of the Wais III-Chinese version), the Need for Cognition Scale,
Openness and Conscientiousness subscales of the NEO (Big Five Personality Inventory)
and the Concern for Truth Scale, whose items reflect the tendency to look for answers or
solutions from preconceptions, authorities or other people, rather than exercising
independentjudgment based on truth, evidence, and reasoning (Ku & Ho, 2010). These
researchers found the following correlations with the HCTA: r = .45 with cognitive ability; r =
.10 for Need for Cognition; r = .06 for Openness to Experience; r = -.08 for
Conscientiousness; and r = .34 for Concern for Truth. Thus, low (close to zero) correlations
were replicated with Conscientiousness with this Chinese sample, and moderate positive
correlations were found with cognitive ability and Concern for Truth.

(Quasi-)Experimental evidence on the construct validity:

Most schools and universities claim that they produce students who can think critically. But
how can they assess gains in critical thinking and determine if their educational practices
have resulted in improvements in critical thinking that transfer to novel situations and across
disciplines? Critical thinking instruction is based on two fundamental premises: (1) there are
clearly identifiable and definable critical thinking skills that students can be taught to
recognize and apply, and (2) when these skills are recognized and applied students
become more effective thinkers. There is a very large research literature showing that when
critical thinking skills are explicitly taught and when they are taught for transfer, students do
become better thinkers. (For a review of the research literature on teaching for critical
thinking see Butler & Halpern, 2011; Halpern & Butler, 2013; Moseley, Baumfield, & Elliott,
2005, and Sternberg, Roediger, & Halpern 2007).

Two studies with the goal of enhancing critical thinking skills in low performing high school
students in an inland area of California are described in Marin and Halpern (2011). Two
groups of high school seniors from low performing schools in California, USA took the
HCTA as a pretest and posttest with two types of critical thinking intervention for the
experimental groups (critical thinking taught either as imbedded in the regular curriculum or
as a separate topic) with wait-list control groups (Marin & Halpern, 2011). In the first study,
students were assigned at random to either control (no critical thinking intervention) or one
of two experimental groups. All participants in the first study were volunteers who were paid
for their participation. For the second study, entire classes were randomly selected to serve
as control or experimental groups. The participants in the second study were not paid.
Participation was required in their classes. (Additional information about the samples is
provided in the published report.) The treatment consisted of either explicit instruction in
critical thinking skills (with materials designed for high school students) or implicit instruction
in which the critical thinking skills were embedded into the course content. For this study,
the 5 scenarios that concern likelihood and uncertainty were not included because these
topics were not included in the intervention that was designed for these high school
students because of their generally low scores on tests of mathematics. Cronbach’s alpha
reliability coefficients were calculated for the post-test scores on the HCTA, which had 20
everyday scenarios on a variety of topics, each followed by questions that first require a
constructed (open-ended) response then forced choice items (i.e., multiple choice or
multiple rating); for constructed response and forced choice questions combined, α = .82.
For constructed response questions only, α = 0.81; for forced choice questions alone, α =
0.79; indicating satisfactory internal consistency for the critical thinking measure. Other
academic related assessments were correlated with the HCTA. Pretest scores on the
HCTA were significantly correlated with academic measures, although there were large

24
HCTA

differences in the size of the correlations, Grade Point Average (r = 0.34, p = .006),
California Standards Test in mathematics (r = .39, p = 0.004), California Standards Test in
English/Language Arts (r = 0.60, p < 0.001), California High School Exit Exam in
Mathematics (r = 0.56, p < 0.001), and California High School Exit Exam in English
Language Arts (r = 0.57, p < 0.001). Additional analyses indicated that students who were
enrolled in below grade level math courses scored significantly lower on the pre-test of
critical thinking skills than those at or above grade level, (t(61) = 4.46, p <.001);(d = 1.33). A
similar analysis using grade level of English courses failed to obtain significance.

The high school students who were enrolled in a science course scored significantly higher
on the HCTA than those who were not enrolled in a science course, (t(60) = 3.11, p < .001),
(d = 0.80). The two dispositional assessments were also correlated with the HCTA,
“Conscientiousness” and “Need for Cognition,” had low correlations with scores on the
HCTA (r = 0.20, p = .051) (r = 0. 20, p = 0.051), respectively. The low positive correlations
are higher than those found with other populations suggesting that these personality
variables may be more predictive of success for high school students than other groups.
Participants completed the Comprehensive Ability Battery, consisting of short timed tests,
which measure Verbal, Numerical, and Spatial abilities (CAB V, N, S; Hakstian & Cattell,
1975). Significant correlations between performance on the assessment of critical thinking
were found for all three cognitive ability tests: CABV and HCTA (r = 0.54, p < 0.001); CAB-
N and HCTA (r = 0.26, p = 0.044), CAB-S and HCTA (r = 0.28, p = 0.029).

Correlation between Halpern Critical Thinking assessment and everyday decision making:

Most critical thinking assessments predominately use scholastic achievement measures to


validate their critical thinking measures. Butler (2012; Butler, Dwyer, Hogan, Franco, Rivas,
Saiz & Almeida, 2012, study 1) argued that critical thinking assessments intended to
measure critical thinking at a broader level should also correlate with the quality of everyday
life decisions made by the respondents.

Based on theoretical considerations, the authors (Butler, 2012; Butler et al., 2012, study 1)
expected critical thinkers to make more informed decisions in everyday life situations, which
should help them to avoid negative life events that may arise from sub-optimal decisions.

The authors used items from the Decision Outcome Inventory (DOI: de Bruin, Parker &
Fischhoff, 2007) to measure the quality of respondents decisions in various everyday life
situations (e.g. interpersonal, business, financial). The inventory consists of k=28 item sets
and k=9 individual items. Each item set comprises a negative life event and a decision,
which preceded it and made the negative life event possible. Respondents were asked to
indicate, whether they experienced the negative life event and the decision(s) preceding it
within the last six months using a dichotomous rating scale (yes/no). The total score
obtained in this measure is divided by the number of possible negative life events. Thus,
high scores in this self-report inventory are indicative of less than optimal decision making
in real life settings.

Butler (2012) administered the Decision Outcome Inventory (DOI: de Bruin et al., 2007)
together with the HCTA to a sample of n=131 respondents. Respondents were sampled
from three qualitatively different populations (n=35 community college students; n=46 state
university students and n=50 community adults). The total sample consisted of 87 (66.40%)
females and 44 (43.60%) males aged 18-71 years (Mean=27.15; SD=13.16) with a diverse
ethnic background (43.80% Caucasian, 21.90% Hispanic or Latino/a, 15.50% Asian, 7.80%
African American, 5.50% multi-racial and 5.40% either refused to provide information on
their ethnicity, reported another ethnical background). Further details on the characteristics
of the sample (e.g. annual income) are available in Butler (2012).

25
HCTA

In a first step, the Pearson correlation coefficients between respondents’ scores in the
Decision Outcome Inventory (DOI) and the Halpern Critical Thinking Assessment (HCTA)
were calculated for the joint sample and all three sub-samples separately. The results are
summarized in Table 7.

Table 7: Pearson correlation between HCTA and DOI

Total sample Community college students State university students Community adults
-0.38 (p=0.001) -0.23 (p>0.05) -0.29 (p=0.047) -0.59 (p=0.001)

Within the sample of community adults the zero-order correlation was quite strong, while it
was modest in size in the state university sample and statistically not significant – albeit at
least moderate to small in size – in case of the community college student sample. The
authors believe that the lower relationships with the student sample resulted from the
nature of the items on the DOI, which were heavily weighted with actions that would be
associated with older adults suggest as declaring bankruptcy or missing a mortgage
payment.

Next, a hierarchical stepwise regression analysis was calculated to examine, whether the
size of the relation between the Decision Outcome Inventory (DOI) and the Halpern Critical
Thinking Assessment (HCTA) differs across the three sub-samples, which would indicate a
predictive bias. In a first step, respondents’ HCTA score was entered into the regression
model together with two dummy coded variables, which represented respondents’ group
membership. In calculating these dummy variables the community adult sample served as
a reference sample. Next, two interaction terms between the dummy coded group
membership variables and respondents HCTA scores were entered into the equation.
Respondents’ Decision Outcome Inventory (DOI) scores served as dependent variable.
If the dummy coded group membership variables reach significance, uniform predictive bias
exists and reflects itself in different regression intercepts for the three sub-samples. Non-
uniform predictive bias exists, if the slopes of the regression equation differ across the three
sub-samples. In the present analysis, this would be reflected in a significant interaction
between the group membership dummy variables and respondents’ HCTA test
performance, which should also result into a significant change in R² because these
interaction effects were entered into the equation at the second step. The results of these
analyses are summarized in Table 8.

Table 8: Standardized β weights and global fit statistics of the hierarchical multiple regression analysis.

Standardized β weights of the predictors Model 1 β Model 2 β


Constant
Community college students -0.112 p>0.05 -0.103 p>0.05
State university students 0.054 p>0.05 0.031 p>0.05
HCTA score -0.416 p <0.001 -0.610 p>0.05
HCTA by community college students 0.184 p>0.05
HCTA by state university students 0.168 p>0.05
Global fit statistics Model 1 Model 2
R² 0.163 .185
F 8.254 p <0.001 5.679 p <0.001
ΔR² -- 0.022
ΔF -- 1.684 p>0.05

The results indicated that neither uniform, nor non-uniform predictive bias existed. This
conclusion is based on the finding that adding interaction effects in the second step of the
hierarchical multiple regression analysis did not yield a significant improvement in R²
change and both interaction effects were non-significant in the final model. The relation
between critical thinking and quality of everyday life decision turned out to be invariant
across the three sub-samples investigated (lack of non-uniform predictive bias).

26
HCTA

Furthermore, the main effects of group membership also failed to reach significance (lack of
uniform predictive bias). The only predictor variable that turned out to contribute
significantly to the prediction of respondents’ quality of decisions in everyday life situations
was the total score obtained in the HCTA. This implies that critical thinking is related to the
quality of decisions made in everyday life situations and the relation between these two
constructs is identical across community college students, state university students and
community adults.

In a re-analysis of these data the authors also examined the contribution of the HCTA-free
recall and HCTA-recognition scores to the predictive validity of the test. In a first step the
Pearson correlation coefficients between respondents’ scores in the Decision Outcome
Inventory (DOI) and the HCTA-free recall and recognition scores were calculated. The
results are summarized in Table 9.

Table 9: Pearson correlation between HCTA scores and DOI

HCTA total score HCTA-recognition HCTA-free recall


-0.358 (p<0.001) -0.275 (p=0.002) -0.380 (p<0.001)

The results indicated that the predictive validity of the HCTA-free recall score exceeds the
predictive validity of the HCTA-recognition score and the total score. Nevertheless, the
HCTA-recognition score turned out to be significantly related to the criterion measure,
which argues for the predictive validity of the test form S2, which is intended to serve as a
screening of critical thinking skills.

Next, a hierarchical stepwise regression analysis was calculated to examine, whether the
HCTA-free recall score contributes incrementally to the prediction of everyday decision
making beyond the HCTA-recognition score. The results are summarized in Table 10.

Table 10: Standardized β weights and global fit statistics of the hierarchical multiple regression analysis.

Standardized β weights of the predictors Model 1 β Model 2 β


Constant
HCTA-recognition -.275 p=.002 -0.004, p=0.972
HCTA-free recall -0.377, p=0.002
Global fit statistics Model 1 Model 2
R .275 .380
R² .075 .144
F 10.116, p=.002 10.356, p <0.001
ΔR² -- 0.069
ΔF -- 9.873, p=0.002

The results indicated that the HCTA-free response score contributed incrementally to the
prediction of everyday decision making. In fact, if both scores (free response score and
recognition score) are entered into the predictive model the free response score turned out
to be considerably more relevant than the recognition score. This indicated that using the
constructed response format improves the predictive validity of the Halpern Critical Thinking
Assessment considerably; although the multiple choice short version is significantly related
to relevant real word criteria when used on its own.

Even though evidence on the criterion validity would have been stronger if respondents’
decision making quality would have been observed in vivo, the results generally support the
criterion validity of the HCTA with regard to an important real-life criterion that has often
been neglected in evaluating critical thinking assessments. Subsequent studies may also
wish to resort to an experimental training approach to evaluate the hypothesis that critical
thinking is causally related to the quality of decisions in everyday life situations in order to

27
HCTA

further strengthen the empirical evidence on the criterion validity and practical relevance of
the HCTA.

4.4 Economy
Being a computerized test, HCTA is very economical to administer and score. Because the
pre-test instruction is standardized and raw and norm scores are calculated automatically,
the administrator does not need to spend time giving verbal instructions or calculating raw
and norm scores.

4.5 Usefulness
"A test is useful if it measures or predicts a personality trait or mode of behaviour for the
assessment of which there is a practical need. A test therefore has a high degree of
usefulness if it cannot be replaced by any other test.” (Lienert & Raatz 1998, p.13,
translated) HCTA can be regarded as useful since critical thinking is often identified as a
key educational outcome, and it is valued in most employment settings.

4.6 Reasonableness
Reasonableness describes the extent to which a test is free of stress for the test subject;
the respondent should not find the experience emotionally taxing and the time spent on the
test should be proportional to the expected usefulness of the information gained (Kubinger,
2003). Since respondents are neither put under mental or physiological stress nor under
time pressure, HCTA fulfills the criterion of reasonableness.

4.7 Resistance to faking


A test that meets the quality criterion of resistance to faking is one that can prevent a
respondent answering questions in a manner deliberately intended to influence or control
his test score (Kubinger, 2003). In particular, respondents may adapt their responses in
order to give socially acceptable answers or to portray themselves in a good light. One
advantage of the Halpern Critical Thinking Assessment is that the best response is often
not the most socially appropriate one, and the quality of a response depends on the criteria
used in making the response.

4.8 Fairness
If tests are to meet the quality criterion of fairness, they must not systematically discriminate
against particular groups of respondents on the grounds of their sociocultural background
(Kubinger, 2003).
The structural invariance and use of the HCTA across cultures demonstrates its fairness to
people from different cultural groups. Furthermore, across all studies, there have been no
statistically significant sex differences, thus showing the fairness of scores for males and
females.

28
HCTA

5 NORMS
The norms were obtained by calculating the mean percentile rank PR(x) for each raw score
X according to the formula (from Lienert & Raatz, 1998):

cum fx − fx 2
PRx = 100 ⋅
N

cum fx corresponds to the number of respondents who have achieved the raw score X or a
lower score, fx is the number of respondents with the raw score X, and N is the size of the
sample.

All the norms are provided as percentile ranks and T scores and will be automatically
calculated by the software. Test administrators may inspect the norm table by means of the
norm table explorer implemented into the Vienna Test System.

5.1 Description of the norm samples


samples

5.1.1. Test forms


fo rms S1 and S2
The norm data of test forms S1 and S2 were gathered between 2009 and 2014 under the
guidance of the test author. The 2015 norm sample comprised 482 respondents aged
between 18 and 72. The mean age was 27.21 years with a standard deviation of 10.2
years. The median age was 23 years. It consisted of 178 (36.6%) males and 228 (46.9%)
females. There were no data on the gender of the remaining 80 (16.5%) respondents. A
total of 156 (32.1%) respondents have had a school leaving qualification at university
entrance level (=EU educational level 4, USA highschool or secondary school graduation)
and 42 (8.6%) respondents have a university degree or are currently enrolled in post-
secondary education (=EU educational level 5). There were no data on the educational
level of the remaining 288 (59.3%) respondents.

D escriptive statistics of the norm data


Although norms are automatically calculated and can be easily inspected using the norm
table explorer implemented into the Vienna Test System, Table 11 provides the descriptive
statistics of the main variables and differentiated (component) variables obtained in the
norm sample described in chapter 5.1.

29
HCTA

Table 11: Descriptive statistics for the main variables and differentiated (component) variables in the norm
sample

Main variable Mean SD Min Max


Critical Thinking 90.53 18.88 28 130
Critical Thinking – recognition 51.73 9.31 19 69
Critical Thinking – free recall 38.8 11.25 2 65
Differentiated variable Mean SD Min Max
Verbal Reasoning 7.91 2.66 0 16
Verbal Reasoning – recognition 2.83 0.88 0 5
Verbal Reasoning – free recall 5.08 2.23 0 12
Argument Analysis 23.49 6.13 1 36
Argument Analysis – recognition 12.09 3.48 1 18
Argument Analysis – free recall 11.39 3.65 0 19
Thinking as Hypothesis Testing 19.43 5.18 4 31
Thinking as Hypothesis Testing – recognition 7.24 2.89 0 13
Thinking as Hypothesis Testing – free recall 12.19 3.06 2 18
Likelihood and Uncertainty 9.7 3.76 0 17
Likelihood and Uncertainty – recognition 5.59 3.03 0 11
Likelihood and Uncertainty – free recall 4.11 1.27 0 6
Decision Making and Problem Solving 30.02 6.21 7 42
Decision Making and Problem Solving – recognition 9.5 3.52 0 17
Decision Making and Problem Solving – free recall 20.52 3.91 7 28

Effect of G ender on T est Performance


There were no statistically significant differences between women and men on education
level, or test scores overall or constructed and forced choice test items.) There were no
significant gender differences in ‘Critical Thinking – recognition’ (t[404]=.614, p=0.54),
‘Critical Thinking – free recall’ (t[404]=1.559, p=0.12), or the global score of ‘Critical
Thinking’ (t[404]=1.229, p=0.22).
Based on these results there was no need to calculate separate norms for the different
subsamples.

5.1.2.
5.1.2. Test forms S3 and S4
S4
The norm data of test forms S3 and S4 were gathered between 2014 and 2015 under the
guidance of the test author. The 2015 norm sample comprised 313 respondents aged
between 19 and 66. The mean age was 25.23 years with a standard deviation of 8.242
years. The median age was 21 years. It consisted of 78 (24.9%) males and 132 (42.2%)
females. There were no data on the gender of the remaining 103 (32.9%) respondents.

Descriptive statistics of the norm data


Although norms are automatically calculated and can be easily inspected using the norm
table explorer implemented into the Vienna Test System, Table 12 provides the descriptive
statistics of the main variables and differentiated (component) variables obtained in the
norm sample described in chapter 5.1.

30
HCTA

Table 12: Descriptive statistics for the main variables and differentiated (component) variables in the norm
sample

Main variable Mean SD Min Max


Critical Thinking 87 15.05 38 125
Critical Thinking – recognition 49.68 7.27 21 64
Critical Thinking – free recall 37.32 10.43 10 64
Differentiated variable Mean SD Min Max
Verbal Reasoning 8.11 2.45 2 16
Verbal Reasoning – recognition 5.12 2.04 1 11
Verbal Reasoning – free recall 2.99 1 0 5
Argument Analysis 22.77 4.93 3 34
Argument Analysis – recognition 10.72 3.22 0 19
Argument Analysis – free recall 12.04 2.88 2 18
Thinking as Hypothesis Testing 16.81 4.72 4 27
Thinking as Hypothesis Testing – recognition 6.7 2.88 0 13
Thinking as Hypothesis Testing – free recall 10.11 2.75 2 16
Likelihood and Uncertainty 9.54 3.14 1 16
Likelihood and Uncertainty – recognition 5.52 2.68 0 11
Likelihood and Uncertainty – free recall 4.02 1.08 0 6
Decision Making and Problem Solving 29.78 5.32 8 41
Decision Making and Problem Solving – recognition 9.26 3.5 0 17
Decision Making and Problem Solving – free recall 20.52 3.36 6 27

Effect of Gender on Test Performance


There were no statistically significant differences between women and men on any of the
variables (including age, education level, or test scores overall or constructed and forced
choice test items.) There were no significant gender differences in ‘Critical Thinking –
recognition’ (t[208]=1.221, p=0.223), ‘Critical Thinking – free recall’ (t[208]=0.041, p=0.967),
or the global score of ‘Critical Thinking’ (t[208]=0.614, p=0.540).
Based on these results there was no need to calculate separate norms for the different
subsamples.

31
HCTA

6 TEST ADMINISTRATION
There are two main options for administering the HCTA. It can be administered on-line or
on individual computers. Respondents can receive a link to the test site in their email or via
any other electronic means (e.g., click on a link next to their name on a spread sheet). The
link takes the participant to the HCTA. Test administrators can opt to have normed scores
returned for each test taker (i.e., all grading done by the publisher) or to grade the
constructed response questions using the question prompts.

Alternatively, respondents can take the HCTA on individual computers located at a test
center, office, laboratory, or any other place that is quiet and where the respondent will not
be disturbed. The Vienna Test system currently runs on systems using recent Windows
operating systems. If the individual computer option is selected, the Vienna Test System
needs to be installed and loaded with the HCTA. Forced choice questions are automatically
scored and construced response questions would be scored by the administrator using a
computer-guided grading system.

Regardless of the method selected, the program begins by asking for basic demographic
information (e.g., name, age, gender, educational level), which is followed by instructions
then the assessment. HCTA consists of a combined instruction and practice phase and the
test phase.

6.1 Instruction and practice phase


An explanation of the 20 scenarios is presented at the start of the assessment. The
respondent is told that each scenario is accompanied by a series of open ended and
multiple choice questions that probe the respondent’s reasoning about the scenario.
Following the instructions, a sample scenario is presented and the respondent is asked to
answer an open ended question about the sample scenario. An example of a correct
solution is provided in order to demonstrate this task to the respondent (see Figure 2).

32
HCTA

Figure 2: Sample scenario for the HCTA


Respondents are asked to study the example of the correct solution and compare it to their
own answer. This portion of the instruction phase is specific to test forms S1 and S3.
Immediately after the sample open-ended question is answered, a sample of a multiple
choice question is posed. Respondents are asked to work through the sample on their own.
After pressing the continue button, feedback is provided. Respondents are asked to modify
their own answer in line with the feedback provided. This portion of the instruction and
exercise phase is common to all test forms.

6.2 Test phase


In the test phase, the respondent responds to 20 scenarios, which are evenly divided
among the five facets of critical thinking. In test forms S1 and S3, the respondents work on
open ended and multiple choice questions for each of the scenarios. In test forms S2 and
S4, the respondent only needs to answer the multiple choice portion. There are no time
limits to complete the test and no feedback is provided during the test phase. Once an
answer has been given, the respondent presses the “continue” button to either move to the
next set of questions, or to read the next scenario. The items are presented in a fixed linear
order.

6.3 Scoring module in test form S1 and S3


Since test form S1 uses an open ended / constructed response format, the answers given
by the respondents need to be scored by a human rater. The scoring can be done either
immediately after the respondent completed the test, or at a later time.
In the Vienna Test System, HCTA results which have been already scored are marked with
a green checkmark. Results which still have to be scored are marked with a red cross.

33
HCTA

If the test administrators opt to administer the HCTA on individual computers using the
Vienna Test System, they can access scored data using the following button in the Test
results section of the Vienna Test System:

Figure 3: The ‘Show’ button for accessing rated test results


If the test administrators want to score HCTA data, they can use the Evaluate button:

Figure 4: The ‘Evaluate’ button for rating unrated test results


In case the test administrators want to score the responses immediately, they need to click
‘Evaluate’ to get to the scoring module (see Figure 4). At this point, each constructed
response will be displayed along with a series of grading prompts that are answered along
the 3-point continuum “clearly indicated—less clearly indicated—not indicated.” or using
“yes” and “no”. An example is shown in Figure 5.

34
HCTA

Figure 5: Scoring module


The screen for the scoring module is divided in two parts. The upper part contains the
scenario (framed text on the top), the question posed to the respondents and the answer
they provided (text written in blue color). In the lower part of the screen a series of simple
questions is presented to the rater. These questions ask the rater to evaluate the extent to
which certain content matter is indicated in the respondent’s answer. The rater has to
answer these questions using either the answer alternatives ‘Yes’ / ‘No’ or ‘clearly indicated’
/ ‘less clearly indicated’ / ‘not indicated’.

The rater may save all ratings by using the ‘Save’ button in the top left corner of the window
of the scoring module. The ‘Manual’ button allows to access the test manual. The arrow
buttons at the button of this window allow to proceed to later rating questions and to return
to previously presented rating questions. By closing this window, the rater may exit the
scoring module. After exiting the scoring module, the rater may access the data again at a
later time to answer the remaining scoring questions. However, it is not possible to alter
scorings made in past scoring sessions.

There is no need to assign scores to the respondent’s answers. This is done automatically
by the scoring module on the basis of the rater’s answers to the questions posed on the
lower part of the screen of the scoring module. The scoring needs to be carried out in a
single session. There is no way to interrupt the scoring and complete it later once scoring
has begun. Once the scoring has been completed, the rater is immediately directed to the
results of the respondent. The ratings are stored by the system along with the previously

35
HCTA

stored data provided by the respondent. Grading takes approximately 5 to 10 minutes per
HCTA, depending on the clarity of the responses and the experience of the grader (with
experienced graders taking less time than new graders).

6.3.1 Delayed scoring


To re-access a dataset at a later time, go to ‘Scoring’ and search for the test record using
the “Search” field (cf. Figure 6). The search will yield the desired data file. Mark the specific
file and click the button ‘Show’ or “Evaluate’, respectively.

Figure 6: Accessing data files for delayed scoring.


Again, pressing ‘Show’ will lead the test administrator to the test results and ‘Evaluate’ to
the scoring module, respectively.

6.3.2 Data export for statistical package for the social sciences (SPSS)
SPSS is a widely used software for statistical calculations. SPSS data files can easily be
converted to other file formats, like Microsoft Excel or .csv. To export data to SPSS select
the data to export, and click the ‘SPSS’ button in the ‘Export’ field in the Vienna Test
System:

Figure 7: The Export field.


By clicking ‘Data Export for SPSS’ a navigation box will open, which allows the test
administrator to select a drive location for saving the exported data. (The CSV option in the
export field allows the user to save data as .csv files, while the WTS option allows to export
data as *.xstp files. *.xstp files can be used to move datasets between various installations
of the Vienna Test System.)
The data conversion process yields two standard SPSS files: a SPSS Syntax file, which is
denoted by the extension .sps and a data file, which is denoted by the extension .asc. The
syntax file and the .asc file will have the name of the test and test version, for example
HCTA01.sps and HCTA01.asc. To run the syntax file in SPSS, open the syntax file,
highlight the syntax and select the ‘run’ option. Please note: In order to export the data you
must have the graded data and SPSS on the same computer.
SPSS is not included with the HCTA. Users will need to obtain a valid license of SPSS to
use this program feature.

36
HCTA

7 INTERPRETATION OF TEST
TEST RESULTS
7.1 General notes on interpretation
As a general rule, a percentile rank of <25 can be regarded as below average. An individual
with such a result can be regarded as having below-average ability in comparison to the
reference population.
A percentile rank between 25 and 74 is an average score. The ability of an individual whose
score is in this range is in broad terms typical of that of the reference population.
Percentile ranks >75 reflect a clearly above average result. In comparison to the reference
population, individuals with percentile ranks in this range demonstrate above average
ability.

7.2 Interpretation of the variables of HCTA

Main variables

Critical Thinking
This variable provides a global indication of the respondent’s critical reasoning skills.
A high score (PR>75) indicates that the critical reasoning skill of the respondent is strong.

Critical Thinking - recognition


This variable provides a global indication of the extent to which the respondent is able
passively use critical reasoning skills in the course of an everyday conversation and
reasoning.
A high score (PR>75) indicates that the respondent’s ability is strong.

Critical Thinking – free recall


This variable provides a global indication of the extent to which the respondent tends to
actively use critical reasoning skills in the course of an everyday conversation and
reasoning. A high score (PR>75) indicates that the respondent’s ability is strong.

7.3 Additional output of results

Profile
The profile is a diagrammatic representation of the normed test scores; it enables the
respondent’s performance to be compared easily with the reference sample. The gray area
indicates the average range; it covers the mean +/- one standard deviation. Scores in the
white area to the left are below average; those in the white area to the right are above
average. The respondent’s score is indicated by a point. The range marking to the left and
right of this point indicates the range within which the respondent’s performance lies with a
reliability of 95%.

Test protocol
The test protocol for the first phase provides information on respondent’s performance in
the constructed response portion of HCTA. It shows the score obtained by the respondent
for each of the 20 items and the response latency in minutes and seconds. This information
can be used to investigate whether a higher than average number of incorrect answers

37
HCTA

arose at any particular point during the test. The test protocol for the second phase
provides information on respondent’s performance in the multiple choice portion of HCTA.
In the test protocol, the first column denotes the item number, while the second column
provides information on the number of multiple choice questions administered in the course
of this particular item. The entries in the third to twelfth columns denote the score and the
response latency obtained for each individual question associated with a certain item.

Item analysis protocol


protocol
The item analysis protocol contains the responses to each item in HCTA, together with the
awarded score. This part of the test evaluation can be used to inspect single answers to the
HCTA and control the scoring of individual respondents.

38
HCTA

8 GENERAL CONCLUSION
C ONCLUSION
HCTA is the only test of critical thinking that uses multiple response formats, which allows
test takers to demonstrate their ability to think about everyday topics using both constructed
responses and recognition formats. The unique scoring system with its simple scoring
prompts is the key to high inter-rater reliabilities for written responses. Thus, it combines the
ecological validity of open-ended responding with a reliable scoring system. It is also the
only test of critical thinking that can predict what respondents say they do in real-world
settings, outside of the laboratory or testing center.

HCTA has been validated with numerous diverse samples. It is currently used around the
world in many languages. The test will soon be available in the Vienna Test System in
multiple languages. It offers an easy way to assess learning outcomes for programs that
aim to enhance critical thinking and as a means of assessing levels of critical thinking for
ages 18 through adulthood.

39
HCTA

9 REFERENCES
American Management Association. (2010). AMA 2010 Critical Skills Survey. Executive
Summary. Retrieved from
http://www.p21.org/storage/documents/Critical%20Skills%20Survey%20Executive%
20Summary.pdf
American Psychological Association. (2009, November). The assessment cyberguide for
learning goals and outcomes, 2nd ed., Retrieved from
http://apa.org/ed/governance/bea/assessment-cyberguide-v2.pdf
Ananiadou, K., & Claro, M. (2009). 21st Century Skills and Competences for New
Millennium Learners in OECD Countries, OECD Education Working Papers, 41,
OECD Publishing.Retrieved from: http://dx.doi.org/10.1787/218525261154
Arbuckle, J. L. (2003). Amos 5.0 update to the Amos user’s guide. Chicago IL: Smallwaters
Corporation.ridgeman, B., & Moran, R. (1996). Success in college for students with
discrepancies between performance on multiple choice and essay tests. Journal of
Educational Psychology, 88, 333–340.
Butler, H. A., (2012). Halpern Critical Thinking Assessment predicts real-world outcomes of
critical thinking. Applied Cognitive Psychology, doi: 10.1002/acp.2851
Butler, H. A., & Halpern, D. F. (2011). Critical Thinking. Oxford Bibliographies Online. New
York, N.Y.: Oxford University Press. Available from OBO website:
http://aboutobo.com/
Butler, H. A., Dwyer, C. P., Hogan, M. J., Franco, A., Rivas, S. f., Saiz, C. & Almeida, L. S.
(2012). The Halpern Critical Thinking Assessment and real-wprld outcomes: Cross
national applications. Thinking Skills and Creativity, 7, 112-121.
Cacioppo, J. T., Petty, R. E., & Kao, C. F. (1984). The efficient assessment of need for
cognition. Journal of Personality Assessment, 48, 306-307.
Cacioppo, J. T., Petty, R. E., Feinstein, J. A., & Jarvis, W. B. G. (1996). Dispositional
differences in cognitive motivation: The life and times of individuals varying in need
for cognition. Psychological Bulletin, 119, 197-123.
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing
measurement invariance. Structural Equation Modeling, 9, 233-255.
Costa, P. T., & McCrae, R. R. (1992). Revised NEO personality inventory (NEO-PI-R) and
NEO five factor inventory (NEO-FFI) professional manual. Odessa, FL:
Psychological Assessment.
de Bruin, W. B., Parker, A. M., & Fischhoff, B. (2007). Individual differences in adult
decision-making competence. Journal of Personality and Social Psychology, 92,
938-956.
Embretson, S. E. (2005). Measuring human intelligence with artificial intelligence. In R. J.
Sternberg & J. E. Pretz (Eds.). Cognition and Intelligence (pp. 251-267). New York:
Cambridge University Press.
Fischer, S. C., & Spiker, V. A. (2000). A framework for critical thinking research and
training. (Report Prepared for the U. S. Army Research Institute, Alexandria, VA).
Hakstian, A. R., & Cattell, R. B. (1975). The comprehensive ability battery. Champaign, IL:
Institute for Personality and Ability Testing.

40
HCTA

Halpern, D. F. (1994). Critical thinking: The 21st century imperative for higher education.
The Long Term View, 2, 12-16.
Halpern, D.F. (1998). Teaching critical thinking for transfer across domains. American
Psychologist, 53, 449-455.
Halpern, D. F. (2003). Thought and knowledge: An introduction to critical thinking (4th ed.).
Mahwah, NJ: Lawrence.
Halpern, D. F. (2007). Is Intelligence Critical Thinking? Why We Need a New Definition for
Intelligence. In P. Kyllonen, I. Stankov, & R. D. Roberts (Eds.), Extending
intelligence: enhancement and new constructs (pp. 349-370). Mahwah, NJ: Erlbaum
Associates, Inc.
Halpern, D. F. (2014). Thought and knowledge: An introduction to critical thinking (5th ed.).
New York, NY: Psychology Press.
Halpern, D. F. & Butler, H. A. (2013). Assessment in higher education: Admission and
outcomes. In K.F. Geisinger, B. Bracken, J. F. Carlson, Hansen, J.I., Kuncel, N.,
Reise, S., & Rodriguez, M. C. Handbook of testing and assessment in psychology
(Vol. 3; 319-336). Washington, DC: APA Books.

Hart Research Associates. (2013, April 10). It takes more than a major: Employer priorities
for college learning and student success. Washington, DC: Hart Research
Associates.
Hau, K. T., Halpern, D., Marin-Burkhart, L., Ho, I. T., Ku, K. Y. L., Chan, N. M., & et
al. (2006, April). Chinese and United States students’ critical thinking: Cross-
cultural validation of a critical thinking assessment. Paper presented at the
American Educational Research Association Annual Meeting, San
Francisco, CA.
Hu, L. T. & Bentler, P. M. (1999). Cutoff criteria for fit indices in covariance structure
analysis: Conventional criteria versus new alternatives. Structural Equation
Modeling: A Multidisciplinary Journal, 6, 1-55.
Hunt, E. B. (1995). Will we be smart enough? A cognitive analysis of the coming workforce.
New York: Russell Sage Foundation.
Jackson, D. L., Gillaspy, J. A. Jr., & Purc-Stephenson, R. (2009). Reporting practices in
confirmatory factor analysis: An overview and some recommendations.
Psychological Methods, 14, 6-23.
Jones, Dougherty, Fantaske, & Hoffman (1995) Jones, E. A., Dougherty, B. C., Fantaske,
P., & Hoffman, S. (1997). Identifying college graduates' essential skills in reading
and problem solving: Perspectives of faculty, employers, and policymakers.
University Park, PA: US Department of Education.
Jones, E. A., Hoffman, S., Moore, L. M., Ratcliff, G., Tibbetts, S., & Click, B. A. (1995).
National assessment of college student learning: Identifying college graduates’
essential skills in writing, speech and listening, and critical thinking. (NCES 95-001).
Washington, DC: U.S. Government Printing Office.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112,
527-535.
Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational
Measurement, 38, 319-342.

41
HCTA

Ku, K. Y. L. (2009). Assessing students’ critical thinking performance: Urging for


measurements using multi-response format. Thinking Skills and Creativity, 2, 70-76.
Ku, K. Y. L., & Ho, I. T. (2010). Dispositional factors redicting Chinese students’critical
thinking performance. Personality and Individual Differences, 48, 54-58.
Ku, K. Y. L., Ngai-Man, C., Miu-Chi Lun, V. Halpern, D. F., Marin-Burkhart, L., Hau, K-T., &
Ho, I. T, (2006). Chinese and U.S. Undergraduates’ Critical Thinking Skills:
Academic and Dispositional Predictors. Paper presented at the Annual Meeting of
the American Association for Education Research. San Francisco, CA.
Kubinger, K. D. (2003). Gütekriterien. In K. D. Kubinger & R. S. Jäger (Eds.),
Schlüsselbegriffe der psychologischen Diagnostik (p. 195-204). Weinheim:
Psychologie Verlags Union.
Lienert, G.A., Raatz, U. (1998). Testaufbau und Testpraxis. Weinheim: Beltz.
Marin, L., & Halpern, D. F. (2011). Pedagogy for developing critical thinking in adolescents:
Explicit instruction produces greatest gains. Thinking Skills and Creativity, 6, 1-13.
Marsh, H.W., Hau, K.T. & Wen, Z., (2004). In search of golden rules: Comment on
hypothesis testing approaches to setting cut off values for fit indexes and dangers in
overgeneralising Hu & Bentler’s (1999) findings. Structural Equation Modelling, 11,
320-341.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from
persons' responses and performances as scientific inquiry into score meaning.
American Psychologist, 50, 741-749.
Moseley, D., Baumfield, V., Elliott, J., & Higgnins, S. (2005). Frameworks for thinking: A
handbook for teaching and learning. Cambridge, UK: Cambridge Univ. Press.
Riggio, H. R. & Halpern, D. F. (2006). Understanding Human Thought: Educating Students
as Critical Thinkers. In W. Buskist & S. F. Davis (Eds.), Handbook of the Teaching of
Psychology (pp.78-84). Malden, MA: Blackwell Publishing.
Stanovich, K. E. (2009). What intelligence tests miss: The psychology of rational thought.
New Haven, CT: Yale University Press.
Sternberg, R., Roediger, R., & Halpern, D. F. (Eds.). (2007). Critical Thinking in
Psychology. Cambridge, MA: Cambridge University Press.
Trilling, B., & Fadel, C. (2009). 21st century skills: Learning for life in our times. San
Francisco: Jossey-Bass.
Wainer, H. & Brown, H. I. (1988) Test validity. Hillsdale, NJ: Erlbaum.

42

Вам также может понравиться