Вы находитесь на странице: 1из 18


2011, 64 (3), 467 –484

Improving college students’ evaluation of text learning

using idea-unit standards

John Dunlosky, Marissa K. Hartwig and Katherine A. Rawson

Kent State University, Kent, OH, USA

Amanda R. Lipko
The College at Brockport State University of New York, Brockport, NY, USA

When recalling key definitions from class materials, college students are often overconfident in the
quality of their responses. Even with commission errors, they often judge that their response is entirely
or partially correct. To further understand this overconfidence, we investigated whether idea-unit jud-
gements would reduce overconfidence (Experiments 1 and 2) and whether students inflated their
scores because they believed that they knew answers but just responded incorrectly (Experiment 2).
College students studied key-term definitions and later attempted to recall each definition when
given the key term (e.g., What is the availability heuristic?). All students judged the quality of
their recall, but some were given a full-definition standard to use, whereas other students first
judged whether their response included each of the individual ideas within the corresponding
correct answer. In Experiment 1, making these idea-unit judgements reduced overconfidence for com-
mission errors. In Experiment 2, some students were given the correct definitions and graded other
students’ responses, and some students generated idea units themselves before judging their responses.
Students were overconfident even when they graded other students’ responses, and, as important,
self-generated idea units for each definition also reduced overconfidence in commission errors.
Thus, overconfidence appears to result from difficulties in evaluating the quality of recall responses,
and such overconfidence can be reduced by using idea-unit judgements.

Keywords: Overconfidence; Judgement accuracy; Standards of evaluation; Idea-unit standards.

Metacomprehension research investigates people’s or full paragraphs) and then predict how well they
ability to judge their learning and/or comprehen- will perform on tests over the text materials. These
sion of text materials. In a typical experiment, par- predictive judgements are then compared to test
ticipants study multiple short texts (e.g., sentences performance, and a greater correspondence

Correspondence should be addressed to John Dunlosky, Kent State University, Psychology Department, Kent, OH 44242, USA.
E-mail: jdunlosk@kent.edu
Many thanks go to Jeffrey Karpicke for a helpful review of this article. This research was supported by the Institute of Education
Sciences, U.S. Department of Education, through Grant R305H050038 to Kent State University and was partially supported by the
James S. McDonnell Foundation 21st Century Science Initiative in Bridging Brain, Mind and Behavior Collaborative Award. The
opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education or the
James S. McDonnell Foundation.

# 2010 The Experimental Psychology Society 467

http://www.psypress.com/qjep DOI:10.1080/17470218.2010.502239

between predictions and performance indicates cannot be attributed to the student using a faulty
higher levels of accuracy. Generally, the accuracy heuristic about how forgetting will influence per-
of these judgements is relatively low (for recent formance on the future test (Koriat, Bjork, Sheffer,
reviews, see Dunlosky & Lipko, 2007; Thiede, & Bar, 2004).
Griffin, Wiley, & Redford, 2009; Zhao & Second, the literature on metacomprehension
Linderholm, 2008), which can limit the effective- almost exclusively investigates relative accuracy,
ness of students’ self-regulated learning (Thiede, which we do not consider further in this article.
1996; Thiede, Anderson, & Therriault, 2003). Instead, our main focus is on absolute accuracy,
Accordingly, researchers have been exploring because the few studies that have investigated
why the accuracy of students’ judgements is low absolute accuracy have found students’ judgements
and devising techniques that will boost judgement to be substantially overconfident (e.g., Dunlosky,
accuracy (for some recent successes, see Thiede Rawson, & Middleton, 2005; Miesner & Maki,
et al., 2003; Thomas & McDaniel, 2007). 2007). Consider outcomes from Miesner and
Two kinds of judgement accuracy—relative and Maki, who had college students study brief texts,
absolute accuracy—have been investigated in take essay tests, and then judge how well they per-
metacomprehension research. Relative accuracy formed on the tests. Students’ judgements were
refers to the degree to which the relative ordering highly overconfident: On average, they judged
of an individual’s judgements across texts corre- that they had scored 68% on the essay tests, but
sponds to the relative ordering of test performance their test score was only 36%! Given that students
across texts. To obtain high levels of relative accu- tend to terminate study once they believe that they
racy, a student’s judgements should increase across can recall what they are attempting to learn
texts as test performance increases. Absolute accuracy (Kornell & Bjork, 2008), overconfidence can lead
refers to the degree to which the magnitude of the to premature termination of study and lower test
judgements corresponds to the absolute level of test performance (e.g., Nietfeld, Cao, & Osborne,
performance. If a student judges that a particular 2006; Thiede, 1999). Thus, it is important to dis-
text has been entirely learned but test performance cover techniques that reduce students’ overconfi-
for that text is zero, then the student would be dence in their learning, which is a major goal of
demonstrating poor absolute accuracy. In this our experiments.
case, the student’s judgement would demonstrate To better understand our approach, imagine a
a high degree of overconfidence. college student studying key term definitions
In the present research, we explore a new meta- from a chapter in an Introductory Psychology text-
cognitive technique aimed at improving the absolute book. The student wants to master a list of defi-
accuracy of students’ judgements of their learning of nitions about memory concepts, which include
simple text materials. Our approach diverges from terms such as long-term memory, proactive inter-
standard metacomprehension research in two key ference, and encoding specificity. After studying
ways. First, instead of having students predict how all the definitions, the student decides to evaluate
well they will perform on future tests across the how well each has been learned by recalling each
materials, we have them judge how well they have definition from memory and then comparing the
currently learned the materials. Predictions about recall responses to the correct definitions. The stu-
future performance not only tap how accurately dent’s goal is to evaluate which of the concepts
people judge their learning but also can be influenced were not correctly recalled, so that more time can
by their beliefs about retention. By contrast, judge- be devoted to studying them. Many students use
ments about current learning tap the quality of stu- this technique to some extent (e.g., using flash-
dents’ evaluations without being contaminated by cards, Kornell & Bjork, 2007), especially given
beliefs about retention (Rawson, Dunlosky, & that textbook chapters often end with lists of key
McDonald, 2002). Thus, if a student’s judgements terms for review that encourage testing and evalu-
of current learning are inaccurate, such inaccuracy ating the quality of what has been recalled.



When students attempt to recall definitions and to check their recall responses against the correct
then check them against the correct answers, are definition. Students who received this full-defi-
they able to identify when their responses are nition standard showed significantly decreased
incorrect? Unfortunately, the answer to this overconfidence. Nevertheless, even with the
question is “not always”. Even this intuitively correct definition available, these students still
effective strategy for evaluating performance can awarded themselves full or partial credit for 43%
promote overconfidence. Consider outcomes of the commission errors. Thus, this evaluation
from Rawson and Dunlosky (2007); they used a technique—which some teachers and textbooks
method to investigate students’ self-evaluations recommend students use to evaluate their learn-
of recall that we adopt in the present experiments, ing—can actually produce overconfidence (see
so we describe their method in detail first and then also, Baker, Dunlosky, & Hertzog, 2010), which
return to the central outcomes. College students in turn can yield ineffective restudy. As mentioned
studied a list of key terms from Introductory above, ineffective restudy would result because stu-
Psychology, such as “Proactive interference: dents typically stop studying items after they have
Information already stored in memory interferes retrieved them once from memory. Thus, if stu-
with the learning of new information”. After dents believe they have correctly recalled a defi-
studying the concepts, the students then attempted nition that they had actually incorrectly recalled,
to recall each one from memory. For instance, a it would be prematurely dropped from study.
student would be asked, “What is proactive Such overconfidence is even more troubling
inference?” and may respond with “confusing given that commission errors are relatively
information from different sources”. We refer to common when students are learning complex defi-
this recall attempt as prejudgement recall, because nitions (Dunlosky et al., 2005; Kikas, 1998;
immediately after attempting recall they judged Rawson & Dunlosky, 2007).
the quality of their recall response by making the
following self-score judgement: “If the correctness
Explaining and reducing overconfidence
of the definition you just wrote was being
graded, do you think it would receive no credit, A major goal of the present study is to evaluate the
partial credit, or full credit?” For the present efficacy of a standard of evaluation—called an idea-
example, the prejudgement recall is entirely unit standard—for reducing college students’
incorrect; put differently, “confusing information overconfidence in their commission errors. As in
from different sources” is a commission error, so previous research on confidence judgements
the student should respond with “no credit”.1 (Dunlosky & Metcalfe, 2009), we are treating
College students’ self-score judgements are overconfidence as a description of a dependent
substantially overconfident in that they often variable; that is, people are overconfident whenever
assign partial or full credit to commission errors. they judge commission errors to be anything but
For instance, in Rawson and Dunlosky (2007), incorrect. In the present case, if idea-unit standards
one group of students was not allowed to (described in detail below) diminish how often stu-
compare their answers to the correct definition dents judge that a commission error is partially or
when making self-score judgements. That is, entirely correct, by definition this standard is
they were given no standard of evaluation when decreasing overconfidence for commission errors.
scoring their recall. These students awarded Of course, understanding why a particular standard
either partial or full credit to 83% of their commis- reduces overconfidence is important, so one must
sion errors. Another group of students was allowed consider cognitive processes that may explain why

We chose this particular scale because we felt it would be easy for the students to interpret and apply to any given recall protocol.
The scale was also chosen because it could be used by even young students (e.g., grade-school and middle-school students) who may
have difficulties interpreting more sophisticated continuous scales (Lipko et al., 2009).



idea-unit standards may reduce overconfidence. limitations in executive control (for reviews on
Toward this end, we first consider the question, how it may be limited, see Cowan, 2005; Engle &
Why might students show substantial overconfi- Kane, 2004; Oberauer & Kliegl, 2006). Moreover,
dence even when they can compare their answer the construction – integration model of text com-
to a full-definition standard? prehension assumes that due to this limitation,
Although intuition suggests that full-definition text material must be processed in cycles, with the
standards will support highly accurate evaluations comprehension system processing only about the
of the quality of one’s answers, providing a full-defi- equivalent of a simple sentence in working
nition standard does not indicate which aspects of memory at a time (Kintsch, 1988, 1998). In relation
the definition must be in a response for it to receive to the present case, while using a full-definition stan-
partial or full credit. Consider the example given dard, one of the definitions (either the standard or the
above in which a student incorrectly recalls the defi- response) must be kept active in the focus of attention
nition of “proactive interference”. Assume the (storage) while at the same time comparisons must be
student compares the full-definition standard with made between what is active in memory (e.g., the stu-
the response, “confusing information from different dent’s response) while inspecting the other definition
sources”. In this case, prejudgement recall includes a (e.g., the standard). Besides comparing the two defi-
word that is in the correct answer (i.e., the word nitions, the student also must keep track of how
“information”), which may lead the student to much of the correct definition appears in the
believe that the answer should receive some credit, response, which provides a further burden on
even though the key ideas of “information already working memory. Given that merely reading a
stored in memory” or “learning new information” single multiclause definition may often exceed
were not in the answer. Thus, when using a full-defi- working-memory limitations, many students would
nition standard, students (a) may not realize which not be able to keep an entire definition in the focus
ideas are required to obtain credit or (b) may not of attention while systematically comparing it to
spontaneously attempt to deconstruct correct defi- the response. In summary, our proposal is that over-
nitions into idea units, so as to use those to evaluate confidence partly arises from the complexity of the
the quality of their answers. comparison processes (which involve storage and
Even if students do attempt to analyse the idea processing of two definitions) that exceed working-
units within the correct definition when evaluating memory limitations and hence lead to judgement
their answers, they may have difficulties keeping error. Consistent with this possibility, working-
track of which ideas (and how many) were in their memory limitations even contribute to people’s over-
answer. That is, another reason why full-definition confidence in their answers for general-knowledge
standards may not fully reduce overconfidence is questions (Hansson, Juslin, & Winman, 2008).
that comparing answers to the standards can Most important, these explanations for why over-
exceed working-memory limitations. Working confidence occurs for full-definition standards
memory involves storage and the simultaneous pro- predict that idea-unit standards will further reduce
cessing (or manipulation) of information. Working students’ overconfidence in their commission
memory is capacity limited, which may result from errors. Idea-unit standards are constructed by
a relatively fixed storage buffer and/or from parsing a definition into its constituent idea units.2

We have adopted the term idea unit here rather than the more theoretically laden term proposition, in part because of differences
in the grain size of segments used here versus those to which the term proposition is applied. In brief, propositions include atomic
propositions and complex propositions (for detailed discussion, see Kintsch, 1998). Atomic proposition refers to the smallest unit of
information that has a truth value and consists of a predicate and one or more arguments (e.g., BLUE[CAR], or
BAKE[MELISSA,COOKIES]). Complex proposition refers to a larger representational unit usually comprising the multiple
atomic propositions contained within a simple sentence. For present purposes, atomic propositions are at a finer grain and
complex propositions at a coarser grain of analysis than the conceptual units into which we have parsed the key term definitions
here. Accordingly, we use the term idea unit to refer to these intermediate conceptual units of information.



For example, the definition of “proactive interfer- responses by sequentially checking whether each
ence” may be parsed into three idea units: (a) infor- idea unit was in their response (Figure 1). The
mation stored in memory, (b) interferes with idea units are short in length, and hence this
learning, and (c) of new information. The students single comparison (between one idea unit and a
are asked to use these standards by judging response) would probably not exceed working
whether each idea unit for a given definition is in memory. Moreover, once an idea unit is processed,
the corresponding prejudgement recall response. the participants can mark on the computer screen
To obtain these idea-unit judgements in the whether the idea unit is present by clicking on
present experiments, all the idea units of a definition the box beside the idea (see Figure 1); when they
were presented along with the participant’s prejudge- do, a check mark appears in the box, which
ment recall, and the students were asked to mark would externalize some of the demands placed
each idea unit that was present in the recall response. on working-memory resources (in this case,
After identifying the idea units, the students made a keeping track of how much of the critical infor-
self-score judgement (no credit, partial credit, or full mation was in the answer being scored). After
credit) for that recall response. checking each idea unit, participants merely need
Based on the aforementioned theoretical expla- to examine the screen to determine how much
nations for overconfidence in commission errors, credit their response should receive. Most impor-
idea-unit standards were expected to reduce tant, assuming students can accurately make
college students’ overconfidence for two reasons. idea-unit judgements, none should be checked
The first reason is largely instructional: Because for commission errors, which in turn is expected
the idea units themselves indicate the key units to reduce overconfidence for these responses.
of conceptual information, they would implicitly
instruct students what and how much information
is required in their responses to receive credit. The
Overview of experiments
second reason is largely process oriented: The idea In both experiments, college students studied defi-
units would allow participants to evaluate their nitions taken from an Introductory Psychology

Figure 1. Example of how idea-unit judgements were collected from students.



textbook. After studying the definitions, they randomly assigned to one of two groups: full-
attempted prejudgement recall for each one and definition standard (n ¼ 32) or idea-unit standard
then self-scored the quality of their recall. While (n ¼ 28).
making the self-score judgements, a standard was A total of 16 concepts from an Introductory
provided for all items. The kind of standard Psychology textbook were used. Some examples
provided was manipulated between participants: are provided in the Appendix. Computers pre-
Some participants received full-definition sented all materials and recorded all responses.
standards when making self-score judgements (as
in Rawson & Dunlosky, 2007), whereas other Procedure
participants made idea-unit judgements to use as Participants worked individually at computers.
a standard of evaluation for making the self-score They were encouraged to take their time, to read
judgements. For reasons described above, we the instructions carefully, and to work quietly.
were particularly interested in judgements for They began each session by reading detailed
commission errors. Our analytic strategy was to instructions that were presented on the computer
compare students’ mean judgements for their com- screen.
mission errors when they had either full-definition Following the instructions, the key term defi-
standards or idea-unit standards. Any judgement nitions were presented one at a time for self-
greater than 0 (i.e., no credit) would indicate paced study. Thus, a participant would first study
that the student is overconfident. each definition—that is, the question stem (e.g.,
A key prediction was that self-score judgements What is proactive inference?) along with the
for commission errors would be lower for students answer. Each participant paced his or her own
who received idea-unit standards than for those study and clicked a button on the screen to indicate
who received full-definition standards. (Given when they were finished studying a given defi-
that this prediction was based on an a priori nition. Order was randomized anew for each
hypothesis, we tested it using a planned participant.
comparison; Judd & McClelland, 1989.) To fore- Immediately after studying all the definitions,
shadow, in Experiment 1, idea-unit standards did the participants were presented with a key term
reduce students’ overconfidence in commission (e.g., “What is proactive interference?”) and
errors. In Experiment 2, we attempted to replicate attempted to recall its correct definition.
this outcome and also evaluated whether students’ Participants were instructed to try their best to
self-generated idea units would reduce overconfi- generate the answer or as much of the answer as
dence. We also included a group who scored possible for each key term. They typed their
other students’ recall responses in Experiment 2, answers into a text field on the screen. After this
which allowed us to rule out an alternative expla- prejudgement recall attempt, they then judged
nation for students’ overconfidence in commission the quality of their prejudgement recall for that
errors. key term.
For the full-definition group, immediately after
attempting prejudgement recall for a key term,
they were shown the correct definition along
with their response and were then asked to make
a self-score judgement using the options: “no
credit”, “partial credit”, or “full credit”. They
Participants, design, and materials were instructed to make this judgement by com-
A total of 60 undergraduates participated to par- paring their response to the correct answer.
tially satisfy a course requirement in Introductory In the idea-unit group, immediately after they
Psychology. The students were enrolled in a attempted prejudgement recall for a key term,
large university in Northeast Ohio. They were the participants were shown their response along



with the main ideas in the correct answer, which prejudgement recall response was scored as entirely
were presented together as illustrated in Figure correct (if it included all the main ideas), as par-
1. Participants were instructed to check the box tially correct (contained one or more of the
next to each idea that they believed was present correct ideas but not all of them), or as incorrect
in their response. If none of the ideas were (contained none of the correct ideas). Correctly
present, they were instructed to check the box recalled responses were scored as 100 (for 100%
labelled, “None of these ideas are present in my correct); partially correct recall responses were
answer”. After they had finished identifying idea scored as 50 (for half credit), and incorrectly
units, participants continued to be shown their recalled responses were scored as 0. Means across
response along with the idea units (with those participants’ prejudgement recall scores were not
they checked being marked as present) and were significantly different for the group who received
asked to make a self-score judgement (i.e., no full-definition standards (33, SE ¼ 3.0) than for
credit, partial credit, or full credit, as in the full- those receiving idea-unit standards (30, SE ¼
definition group). 4.0), t(58) ¼ 0.70.
The prejudgement recall attempts and all the We also separated the prejudgement recall
judgements were paced by the participants. After responses into one of four categories: omission
completing the recall-judgement trial for a given error (the student provided no response during
key term definition, the next key term was pre- prejudgement recall), commission error (the
sented for prejudgement recall and judgement. student typed in a response, but it was completely
These recall-judgement trials continued until incorrect), partially correct (the student’s response
they had attempted to recall and judge each item contained at least one correct idea unit but not all
once. of them), and correct recall (recall response con-
tained all of the ideas from the correct answer).
The percentages of each prejudgement recall
Results and discussion response are presented in Table 1.
In both experiments, all differences declared as
significant had p , .05. For significant differ- Self-score judgements
ences, we also provide Cohen’s (1988) d as an esti- To investigate how people judged the quality of
mate of effect size. their prejudgement recall, we analysed the self-
score judgements as a function of the categories
Prejudgement recall of prejudgement recall responses. For each partici-
Before reporting our most critical outcomes invol- pant, we computed a mean across self-score judge-
ving the self-score judgements, we briefly present ments by assigning a value of 0 to self-score
analysis of prejudgement recall. Each judgements of “no credit”, 50 for “partial credit”,

Table 1. Percentage of prejudgement recall responses within each response category

Response category

Experiment Standard Omission Commission Partial Correct

Experiment 1 Full definition 13.3 35.7 36.0 15.0

Idea units 13.4 37.7 35.0 13.4
Experiment 2 Full definition 9.5 14.3 49.7 26.5
E-gen idea units 9.8 17.8 49.5 23.0
S-gen idea units 9.2 14.1 46.4 30.4

Note: E-gen ¼ experimenter generated; S-gen ¼ student generated.



and 100 for “full credit”. We then computed the For a finer grained examination of the influence
mean self-score judgement for responses within of standards on self-score judgements, we also
each of the four categories of response of prejudge- computed the percentage of the self-score judge-
ment recall. Most important were the self-score ment ratings (full credit, partial credit, and no
judgements that participants made for commission credit) that were assigned to commission errors
errors. For commission errors, mean self-score (Table 2). To clarify what these values represent,
judgements greater than 0 indicate overconfidence. consider the first row for Experiment 1, labelled
Values are reported in Figure 2, which presents the “Full definition”. When students who received
mean self-score judgement (y-axis) conditiona- full definition standards made commission errors
lized on the scoring of prejudgement recall into during prejudgement recall, they assigned “no
the categories of prejudgement recall response credit” to 52% of these commission errors,
(x-axis). Because the median self-score value for “partial credit” to 37% of these errors, and “full
omission errors was zero, we have omitted this cat- credit” to 11% of them. In contrast, participants
egory from the figure and analyses. who made idea-unit judgements assigned full or
Given that the main objective of this research partial credit to only 18% of the commission
was to reduce students’ overconfidence for their errors (2% and 16%, respectively).
commission errors, our analysis first focuses on Although the mean self-score judgements for
the mean judgements for the commission errors. the other categories of objectively scored
The main prediction was that the mean judgement responses from prejudgement recall (i.e., partial
for commission errors would be lower (indicating correct and correct) are less relevant to the
less overconfidence) with idea-unit standards main objective of this research, we present
than with full-definition standards. Consistent some inferential statistics on the corresponding
with this prediction, the planned comparison for values in Figure 2 for completeness. Self-score
commission errors indicated that self-score judge- judgements for partial responses were signifi-
ments were significantly lower for the idea-unit cantly lower for those receiving idea-unit stan-
standard group than for the full-definition stan- dards than full-definition standards, t(55) ¼
dard group, t(57) ¼ 4.12, d ¼ 1.12. Thus, idea- 5.15, d ¼ 1.37. Finally, an unexpected trend
unit standards did reduce students’ overconfidence, also occurred for correct responses, with slightly
although the mean judgement was significantly lower self-score judgements when participants
greater than 0, t(26) ¼ 4.16, which indicated used idea-unit standards than full-definition
that some overconfidence remained. standards (Figure 1). Although this trend was
not statistically significant, t , 1.4, p . .10, we
further evaluated whether idea-unit standards
(as compared to full-definition standards)
reduced self-scores more for commission errors
than for correct responses. To do so, we con-
ducted a post hoc test of the interaction
between standard (idea unit vs. full definition)
and prejudgement recall (commission vs.
correct responses). The critical trend (indicating
greater reduction in self-score judgements for
commission errors than for correct judgements)
was not significant, F(1, 41) ¼ 0.52, suggesting
that one reason idea-unit judgements may
Figure 2. Mean self-score judgement as a function of prejudgement
reduce overconfidence is by generally promoting
recall responses in Experiment 1. Error bars are standard errors of conservatism in self-score judgements. We
each mean. evaluate this interesting possibility again in



Table 2. Percentage of the three judgements made for commission errors

Judgements made for commission errors

No credit Partial credit Full credit

Experiment Group M SE M SE M SE

Experiment 1 Full definition 52 5 37 4 11 2

Idea unit 82 4 16 4 2 1
Experiment 2 Full definition (FD) 36 9 37 9 26 9
Other-FD 37 10 39 10 24 9
E-gen idea units 73 7 24 6 3 3
S-gen idea units 82 6 18 6 1 1

Note: Judgements: no credit, partial credit, and full credit. Commission errors were student responses during prejudgement recall that
were objectively scored as entirely incorrect. FD ¼ full definition; E-gen ¼ experimenter generated; S-gen ¼ student generated.
SE ¼ standard error of the mean. Some rows do not sum to 100 due to rounding error.

Experiment 2 and revisit the issue of conserva- credit (Figure 1). Thus, it seems likely that idea-
tism in the General Discussion. unit standards influence participants’ judgements
by constraining their attention to relevant infor-
Accuracy of the idea-unit judgements mation. Nevertheless, just partitioning the defi-
Concerning the accuracy of students’ idea-unit nition into smaller units may help, because doing
judgements, when participants reported that an so would encourage a sequential comparison of
idea unit was in their response, was it actually portions of the definition against the response,
there? Two measures were most relevant. Correct which would not exceed working-memory limit-
identification was defined as the probability that a ations. If so, perhaps even a student who generates
participant reported an idea unit as present, given his or her own idea units will benefit from doing
that the idea unit from the correct definition was so. We explore this possibility in Experiment 2.
present in the participant’s prejudgement recall.
Incorrect identification was defined as the prob-
ability that a participant reported that an idea Table 3. Accuracy of the idea-unit judgements
unit from the correct definition was present in
their prejudgement recall, given that it was Identifications
absent. Trained assistants scored the participants’
Correct Incorrect
recall (and consistency was 97% after initial train-
ing). As shown in Table 3 (first row), college stu- Experiment Group Mean SE Mean SE
dents’ idea-unit judgements were relatively
Experiment 1 Idea units .82 .03 .18 .05
accurate: They correctly identified the presence
of idea units in their answers 82% of the time, Experiment 2 E-gen idea units .71 .04 .08 .02
S-gen idea units .72 .05 .13 .03
and, on average, they made incorrect identifi-
cations only 18% of the time. Note: Correct identifications refer to the probability that
In summary, given the low rate of incorrect participants reported that an idea unit was in their response
identifications, one reason that idea-unit standards given that it was present in their response. Incorrect
identifications refer to the probability that participants
improved participants’ ability to evaluate commis-
reported that an idea unit was in their response given that
sion errors may be that presenting the idea units it was not present in their response. The values do not
explicitly indicated which whole ideas (and not need to sum to 1.0. E-gen ¼ experimenter generated; S-
just individual words) were required to receive gen ¼ student generated. SE ¼ standard error of the mean.



EXPERIMENT 2 participant in this group was also yoked to

another participant who scored the yoked partner’s
In Experiment 2, we extended the main findings of recall responses. If full-definition standards
Experiment 1 to answer two main questions: Why provoke a lenient evaluation of one’s own
does making idea-unit judgements reduce over- responses, then those who score other participants’
confidence in commission errors? And, can stu- responses should demonstrate reduced overconfi-
dents generate idea units that will reduce their dence in commission errors.
overconfidence? To the extent that students can
implement idea unit judgements on their own,
Can students benefit from generating idea-
this metacognitive technique can be broadly
unit standards
applied. Given that we did not need a full factorial
design to answer these questions, we separately If students are to use idea-unit judgements to
discuss each one and the planned comparisons evaluate the quality of their recall of class concepts,
used to answer them. then they will need to generate the idea units
themselves whenever they are not provided by
textbooks or teachers. The accuracy of self-score
Why do idea-unit judgements produce less
judgements for participants who generate idea
overconfidence in commission errors than
units also provides a test of one possible reason
do full-definition standards
for why idea-unit judgements reduce overconfi-
One answer is that the latter standards may dence. One hypothesis is that experimenter-pro-
provoke a relatively lenient evaluation of recall vided idea units reduce overconfidence because
responses. For instance, when judging a commis- college students who are provided with a full defi-
sion error, students may focus on the gist of the nition standard do not have the ability to reduce it
correct answer with the full-definition standard into meaningful idea units. The experimenter-
and subsequently assign some credit to an error generated idea units may provide information
because they believed that they knew the correct about each definition that students could not
gist but did not correctly translate it into words develop themselves. If so, students who are asked
during prejudgement recall. Put differently, when to generate idea units will show the same level of
students evaluate their response in comparison to overconfidence as students who received full-defi-
the full-definition standard, they may think “I nition standards. Alternatively, the act of search-
knew it, that’s what I meant to say, but I just did ing for ideas and dividing a response into smaller
not say it right”. If so, then overconfidence in com- parts may help students evaluate their response
mission errors should be minimized for students by reducing the use of working-memory processes.
who evaluate the quality of other students’ recall. In this case, self-generated idea units may yield the
A student who evaluates another student’s recall same reduction in overconfidence as experimenter-
responses can only respond to the quality of the provided idea units because they encourage the use
responses and hence cannot be biased by what of idea units.
the other student believes he or she actually To evaluate these alternative hypotheses, the
knows; that is, when scoring another student’s same basic procedure described above was used,
responses, a scorer does not know what the with the addition of two groups. Before scoring
student had originally “meant to say” but only recall responses, one group made idea-unit judge-
what they did actually recall. ments based on those provided by the exper-
To evaluate this possibility, college students imenter. To provide a replication of the main
studied definitions, attempted to recall each one outcome in Experiment 1, the performance of
from memory, and then self-scored their recall. this group will be compared to that of the
As in Experiment 1, one group received full-defi- full-definition group (described above). More
nition standards while scoring their recall. Each important, after attempting prejudgement recall,



participants in the self-generated group were given students were presented with a key term (e.g.,
the full-definition standard and were asked to “proactive interference”) and were instructed to
identify the main ideas within the definition by recall and write down its correct definition.
making slashes between each idea within it. (To Participants were instructed to try their best to
allow students to make these self-generated idea generate the answer or as much of the answer as
units, paper-and-pencil administration was used possible for each key term. They wrote their
for all groups.) After generating the idea units, answers in blanks provided in the booklet. After
they circled each idea that they believed was in the prejudgement recall attempt for all 16 terms
their response and then made a self-score judge- was completed, the participants were given a
ment of their response. In this way, the idea-unit self-scoring booklet (to be used side by side with
groups were identical except that one circled the recall booklet) in which they then judged the
ideas separated by slashes generated by the exper- quality of their prejudgement recall for each of
imenters, whereas the other circled their self-gen- the key terms.
erated ideas. Comparing their overconfidence in For the full-definition group, the self-scoring
commission errors will allow an evaluation of the booklet showed the correct definition and
alternative hypotheses described above. instructed each participant to compare it to his
or her own answer and then make a self-score jud-
gement using the options “no credit”, “partial
credit”, or “full credit”. This was repeated for
Participants and materials each of the 16 terms.
A total of 95 undergraduates participated to par- For the score-other full-definition group, par-
tially satisfy a requirement in Introductory ticipants began scoring another student’s work
Psychology. They were randomly assigned to one immediately after studying all the definitions.
of four groups: full-definition standards (n ¼ Thus, unlike the other groups, they did not
21), full-definition standards but scoring another attempt prejudgement recall; instead, they were
participant’s recall (n ¼ 21), experimenter-pro- each randomly yoked to a student in the full-defi-
vided idea-unit standards (n ¼ 25), or self-gener- nition group (above), and they scored that stu-
ated idea-unit standards (n ¼ 28). The concepts dent’s answers with the same scale as that used
from Experiment 1 were used. for self-score judgements. A scoring booklet
showed the other student’s typed answers for
Procedure each of the 16 terms and was presented side by
Participants worked individually at desks. They side with a booklet of correct definitions.
were encouraged to take their time, to read the Participants were instructed to compare the other
instructions carefully, and to work quietly. They student’s answers to the correct definitions and
began each session by reading a page of instruc- then to make a score judgement (“no credit”,
tions. Following the instructions, the key term “partial credit”, or “full credit”) for each term.
definitions were presented, one per page in a In the experimenter-provided idea-unit group,
study booklet, for self-paced study. Thus, a partici- the self-scoring booklet showed the main ideas
pant would first study each term (e.g., proactive in the correct answer, which were clearly separated
inference) along with its definition. Participants by slash marks in the definition. Participants were
simply turned a page when they were ready to instructed to compare each idea unit to their recall
study the next definition. Each participant paced response and to circle any idea unit that they
his or her own study and continued to study each believed was present in their response. After they
definition until all 16 had been studied. finished identifying idea units for a given defi-
Immediately after studying the definitions, the nition, participants were asked to make a self-
study booklet was taken away, and participants score judgement. In the self-generated idea-unit
were given a recall booklet. In this booklet, group, the self-scoring booklet showed the



correct definition and instructed the participant to Self-score judgements

divide it up into “idea units”—that is, the most Mean self-score judgement conditionalized on
important chunks of information that should be prejudgement recall is presented in Figure 3.
present in the answer for it to be considered Because the median self-score values for omission
correct. Participants denoted these idea units by errors were zero, we have omitted this category
using a red pen to draw slash marks in the defi- from the figure and analyses. Given that the
nition to divide it into two or more idea units. main objective of this research was to reduce stu-
After idea units were created for a given definition, dents’ overconfidence for their commission
participants were instructed to compare the idea errors, our analysis first focuses on the mean judge-
units to their recall response and to circle any ments for the commission errors. We conducted a
idea unit that they believed was present in their series of planned comparisons that were aimed at
response. After they had finished identifying idea answering the main questions posed in the intro-
units, they were asked to make a self-score judge- duction to Experiment 2. First, we compared jud-
ment. All the judgements were paced by the gements for those in the full-definition group with
participants. those from the score-other full-definition group.
Their judgements did not significantly differ,
t(16) ¼ 0.12. Thus, the overconfidence that
Results and discussion occurs when participants use the full-definition
standard cannot be attributed to their awarding
Prejudgement recall
credit because they believed that they knew the
Means across participants’ prejudgement recall
correct answer but did not respond correctly.
scores were not significantly different for the
Moreover, this outcome rules out other biases
three groups who attempted prejudgement recall
that could arise from students’ beliefs that they
(i.e., those who scored another participant’s recall
should receive some credit because they merely
did not attempt recall), F(2, 71) ¼ 0.73, MSE ¼
tried to recall the definitions. Instead, students in
0.03: full-definition standards (M ¼ 51, SE ¼
general appear to have difficulties in identifying
3.0), experimenter-generated idea-unit standards
commission errors even when they have the
(48, SE ¼ 3.8), and self-generated idea-unit stan-
correct answer at their disposal. We scrutinize
dards (54, SE ¼ 3.6). The percentages of each pre-
other possible reasons for this overconfidence—
judgement recall response are presented in Table 1.
and why idea-unit judgements reduce it—in the
General Discussion.
Second, as evident from inspection of Figure 3,
the main outcome from Experiment 1 was repli-
cated: As compared to full-definition standards,
idea-unit standards reduced overconfidence in
commission errors, t(34) ¼ 5.74, d ¼ 1.9. More
important, self-generated idea-unit standards also
reduced overconfidence, t(37) ¼ 4.40, d ¼ 1.4,
and this effect of idea-unit standards on self-
score judgements for commission errors was just
as large for self-generated idea units as for exper-
imenter-provided idea units, t(37) ¼ 0.91. Self-
Figure 3. Mean self-score judgement as a function of prejudgement score judgements for the idea-unit groups were
recall response. Error bars are standard errors of each mean in significantly greater than 0, ts . 3.0, ps , .05,
Experiment 2. FD ¼ full-definition group; other-FD ¼
participants who scored recall responses from the FD group. Exp
indicating that some overconfidence remained. In
idea units ¼ idea-unit standards provided by experimenter; general, these outcomes indicate that students are
student idea units ¼ idea-unit standards generated by students. capable of generating idea units that can help



them more accurately evaluate commission errors. judgements for these groups did not differ for
Thus, students in the full-definition group either correct responses, t(19) ¼ 1.0.
are not attempting to use this strategy when eval-
uating their responses or are attempting to do so Accuracy of the idea-unit judgements
but are not as good at executing the strategy As shown in Table 3, idea-unit judgements for
given that they cannot externalize the output students with experimenter-provided idea units
from the relevant processes (e.g., by marking idea were relatively accurate: They correctly identified
units and then circling those in their answer). the presence of idea units in their answers 71%
To permit a finer grained examination of the of the time, and, on average, they rarely made
influence of standards on self-score judgements, incorrect identifications. To evaluate the accuracy
we also computed the percentage of the self- of idea-unit judgements for students who gener-
score judgement ratings (full credit, partial credit, ated idea units, we rescored these students’ recall
and no credit) that were assigned to commission responses as a function of the idea units that they
errors (Table 2). When students evaluated their had generated and then computed the measures
recall by comparing it to the full definition, they of accuracy (Table 3). As with experimenter-pro-
awarded credit to 63% of the commission errors. vided idea units, students’ judgements for identify-
By contrast, when the students first made idea- ing whether their responses contained (or did not
unit judgements and then self-scored their recall, contain) their self-generated idea units were also
they awarded credit to only 19– 27% of the com- relatively accurate. The two idea-unit groups did
mission errors. not significantly differ either on correct identifi-
Although the mean self-score judgements for cations, t(50) ¼ 0.14, or on incorrect identifi-
the other categories of objectively scored responses cations, t(50) ¼ 1.23.
from prejudgement recall (e.g., partial correct and
correct) are less relevant to the main objective of Self-generated idea units
this research, we present some inferential statistics To examine the consistency of self-generated idea
on the corresponding values in Figure 3 for com- units with those generated by the experimenter, we
pleteness. One-way analyses of variance computed a correlation for each participant who
(ANOVAs) revealed that self-score judgements generated idea units. Each correlation was com-
were greater for the full-definition group (who puted across definitions between the number of
scored their own recall responses) than for the self-generated idea units and the number of exper-
idea-unit groups both for partially correct imenter-generated idea units. The mean across the
responses, F(2, 71) ¼ 7.5, MSE ¼ 0.03, and for individual participant’s correlations was r ¼ .28,
correct responses, F(2, 70) ¼ 6.0, MSE ¼ 0.04. indicating that the consistency in the number of
As in Experiment 1, we also conducted post hoc idea units was relatively low between the students
analyses to evaluate whether idea-unit (vs. full- and experimenters. Moreover, across participants,
definition) standards reduced self-scores more for this value (i.e., correlation between the number
commission errors than for correct responses. of self- vs. experimenter-generated idea units)
The 2 (standard: full definition vs. experimenter- correlated positively with overconfidence in com-
provided idea unit)×2 (correct vs. commission) missions (r ¼ .47, p ¼ .03). In this case, overcon-
ANOVA revealed a significant interaction, fidence in commission errors was actually greater
F(1, 54) ¼ 4.30, indicating that the reduction in when participant’s self-generated idea units
self-score judgements was larger for commission corresponded more closely with the number of
errors than for correct judgements. Finally, experimenter idea units. These outcomes suggest
comparing the two full-definition groups, judge- that to benefit from using idea units, students’ gen-
ments for those who scored another participant’s erated units do not need to be identical to those
responses were significantly elevated for being used to score their responses (i.e., those
partially correct responses, t(19) ¼ 2.8, whereas used by experimenters or teachers). Thus, at least



for the psychology definitions used in the present impressive ability to identify when idea units
experiment, encouraging students to evaluate were not present in their answers (see the relatively
smaller chunks of information (which would low levels of incorrect identifications, Table 3).
reduce the burden on working-memory resources) The fact that students who generated idea units
will improve their ability to accurately identify also showed reduced overconfidence in their com-
commission errors. mission errors has implications for why full-defi-
nition standards do not entirely reduce
overconfidence. When using full-definition stan-
GENERAL DISCUSSION dards, the students do not attempt to parse the
full definitions into their constituent ideas and/
When students evaluate the quality of their recall or the processing requirements needed to do so
of recently studied definitions, their judgements exceed working-memory limitations. Concerning
are overconfident for commission errors. the latter possibility, when full-definition stan-
Surprisingly, such overconfidence even occurs dards are presented, some students may attempt
when students are given the correct definitions as to parse the definitions for comparison with their
standards of evaluation to compare to their answers. But even if they do, they may have diffi-
answers and when they evaluate other students’ culty making comparisons while keeping key
recall responses. A contribution of the present information activated in memory, such as which
experiments was demonstrating that college stu- ideas they had already evaluated and how many
dents’ overconfidence in commission errors can of the idea units were present in their answer.
be reduced when they compare their recall Although one might assume that this limitation
responses either to experimenter-provided idea- would constrain accuracy when students generated
unit standards (Experiments 1 and 2) or to self- their idea units (Experiment 2), this possible limit-
generated idea-unit standards (Experiment 2). ation was sidestepped given the evaluation format.
Given the promise of this metacognitive tech- In particular, they first marked the ideas for each
nique, we consider some potential theoretical definition on a sheet of paper and then circled
explanations for the effects of idea-unit standards each idea that they believed was in their answer.
and then discuss the educational implications of Thus, this critical information was externalized
this research. and would not demand cognitive resources to
keep it activated as students self-scored their
Why do idea-unit standards reduce students’
Finally, given that students’ self-generated idea
overconfidence in commission errors?
units were not identical to those generated by the
In contrast to full-definition standards, idea units experimenter (those used for objectively scoring
provide a more detailed standard of evaluation responses), perhaps just externalizing the outputs
and are less likely to exceed working-memory from the comparison process is enough to reduce
limitations as students evaluate their responses. overconfidence in commission errors. That is,
More specifically, idea-unit standards provide almost any breakdown of the definitions may
information that illustrates how students should work, because it encourages students to consider
evaluate the quality of their recall by indicating smaller units of information as they evaluate
which concepts in a definition are required in an their responses. We leave further evaluations of
answer to receive partial or full credit. Of course, these possibilities to future research, because
if students cannot accurately identify which idea regardless of the final explanation for the present
units are in their answers, their judgement accu- effects, idea-unit standards do help students
racy would probably not benefit from idea-unit more accurately evaluate commission errors.
standards. Fortunately, without any training, stu- Another reason why idea-unit standards
dents in both experiments demonstrated an may help is that they could promote overall



conservatism in students’ evaluations. For considered three possible factors that may contrib-
instance, as compared to students who received ute to why idea-unit standards reduce students’
full-definition standards, those who used idea- overconfidence in commission errors: They
unit standards not only had reduced overconfi- induce students to use a more detailed standard
dence for commission errors but also tended to of evaluation, they reduce resources needed to
make lower self-score judgements for partial and compare one’s answer to the standard, and they
correct responses. Thus, making idea-unit judge- generally lead to a conservative judgement bias.
ments may generally produce more conservative These factors are not mutually exclusive, and esti-
self evaluation. Although possible, data suggest mating their joint contribution to effects of using
that a shift toward conservatism cannot entirely idea-unit standards could be accomplished using
account for why idea-unit standards reduced over- variations on the methods introduced in the
confidence in commission errors. As evident in present research.
Figures 2 and 3, making idea-unit judgements
lowered self-score judgements more for commis-
Implications for education
sion errors than for correct responses. As indicated
by post hoc analyses, this interaction was not sig- In many classes, students will be asked to learn
nificant in Experiment 1 but it was significant in numerous key-term definitions, whether in
Experiment 2. Introductory Psychology, Biology, English, Art,
Moreover, given the promise of idea-unit or Physics. They attend classes, they study notes
standards for reducing college students’ and their texts, and some students even test them-
overconfidence, we recently evaluated whether selves to evaluate their learning and to prepare for
middle-school students would also benefit from upcoming examinations (Kornell & Bjork, 2007).
using idea-unit judgements (Lipko et al., 2009, Indeed, teachers should encourage students to
Experiment 2). As in the present experiments, use self-testing in preparation for exams: Not
overconfidence in commission errors was signifi- only can testing improve memory itself (for a
cantly lower when middle-school students used recent review, see Roediger & Karpicke, 2006),
idea-unit standards than when they used full-defi- but these tests can help students evaluate their
nition standards. Most relevant here, the idea-unit learning. Roediger and Karpicke described the
standards did not influence self-score judgements latter benefits of self-testing as mediated effects,
for correct responses; in fact, regardless of the because testing can allow students to identify
group, middle-school students’ self-score judge- which materials they do not know and in turn
ments were at least 90% for correct responses. can mediate further learning by allowing students
Thus, even though idea-unit standards do appear to focus restudy on less well known material
to produce some conservatism in college students’ (e.g., Thiede et al., 2003). To benefit students’
self-evaluations, such conservatism will not always evaluations, however, self-testing must allow stu-
account for the large reduction in overconfidence dents to accurately identify which materials have
for commission errors. not yet been well learned. A chilling conclusion
To summarize the theoretical implications of from the present experiments is that students can
this research for understanding overconfidence in be overconfident even when they compare the
commission errors, evidence from the present outcomes of their self-tests to the completely
experiments disconfirmed one plausible hypothesis correct answers. Considering that teachers and
for such overconfidence when students use full- textbooks instruct students to evaluate their learn-
definition standards; namely, this overconfidence ing in this manner, no doubt this seemingly
cannot be attributed to students inflating their foolproof technique can contribute to student
scores because they believe that they know the overconfidence.
correct answer even though they did not correctly Not only did idea-unit standards reduce over-
recall it (Experiment 2). As important, we confidence in commission errors, they also led to



better calibrated judgements for partially correct Even so, given that idea-unit standards can
responses from prejudgement recall. For instance, reduce overconfidence in commission errors, we
in Experiment 1, students receiving full-definition recommend that students adopt this metacognitive
standards indicated that 46% of their partially technique when evaluating their recall. Students
correct responses should receive full credit, could be easily trained to use idea-unit standards
whereas this value was reduced to 13% for those and apparently have the skill to develop their
receiving the idea-unit standards, t(56) ¼ 5.1, p own idea units for definitions. Even end-of-
, .05. In Experiment 2, the corresponding chapter tests in textbooks could be altered to
values were 43%, 12%, and 24% for the full-defi- support students’ evaluations: Glossaries of defi-
nition group and the experimenter- and student- nitions could place breaks between the core idea
generated idea-unit groups, respectively. In both units of definitions, so that students could mark
comparisons with the full-definition standard, those ideas that were in their answers. Even if
the idea-unit standards significantly reduced stu- such recommendations are somewhat premature,
dents’ overconfidence in their partially correct the promise of idea-unit standards for improving
responses, ts . 3.0, ps , .01. Thus, across both students’ self evaluations certainly warrants
experiments, students using idea-unit standards further empirical scrutiny.
more accurately identified when their answers
were partially (but not fully) correct. Original manuscript received 19 January 2010
The idea-unit standards appeared to limit the Accepted revision received 04 May 2010
First published online 10 August 2010
accuracy of students’ evaluations for only correctly
recalled responses. In particular, students’ self-
score judgements showed more underconfidence
for correctly recalled responses when they received
idea-unit standards. In the worst case scenario, this REFERENCES
underconfidence may lead students to restudy and
overlearn some definitions that they already know. Baker, J. M. C., Dunlosky, J., & Hertzog, C. (2010).
Although such overlearning may decrease the effi- How accurately can older adults evaluate the
ciency of learning, most students (and teachers) quality of their text recall? The effect of providing
probably would prefer underconfidence that standards on judgment accuracy. Applied Cognitive
Psychology, 24, 134–147.
yields some overlearning than substantial overcon-
Cohen, J. (1988). Statistical power analysis for the behav-
fidence that may lead to premature termination of
ioral sciences (2nd ed.). Hillsdale, NJ: Lawrence
restudy for unlearned material. Ironically, Erlbaum Associates.
however, the underconfidence may actually Cowan, C. (2005). Working memory capacity. New York:
benefit students’ performance, because even after Psychology Press.
a student correctly recalls an item, their memory Dunlosky, J., & Lipko, A. R. (2007). Metacomprehension:
for it improves after further recall and restudy A brief history and how to improve its accuracy.
attempts (Pyc & Rawson, 2009). A potential diffi- Current Directions in Psychological Science, 16,
culty here is that when students judge that they 228–232.
have correctly recalled an answer, they may be Dunlosky, J., & Metcalfe, J. (2009). Metacognition.
unlikely to restudy it when doing so would boost Los Angeles: Sage.
Dunlosky, J., Rawson, K. A., & Middleton, E. L.
their learning even further. Accordingly, just accu-
(2005). What constrains the accuracy of meta-
rately judging one’s responses will not be a panacea
comprehension judgments? Testing the transfer-
for obtaining durable memories. Instead, to reap appropriate-monitoring and accessibility hypo-
the benefits of accurate judgements, students may theses. Journal of Memory and Language, 52,
need to be trained how best to use their judge- 551–565.
ments to guide study within and across practice Engle, R. W., & Kane, M. J. (2004). Executive atten-
sessions. tion, working memory capacity, and a two-factor



theory of cognitive control. In B. H. Ross (Ed.), The Oberauer, K., & Kliegl, R. (2006). A formal model of
psychology of learning and motivation: Advances in capacity limits in working memory. Journal of
research and theory (Vol. 44, pp. 145–199). Memory and Language, 55, 601–626.
New York: Elsevier Science. Pyc, M. A., & Rawson, K. A. (2009). Testing the retrie-
Hansson, P., Juslin, P., & Winman, A. (2008). The role val effort hypothesis: Does greater difficulty correctly
of short-term memory capacity and task experience recalling information lead to higher levels of
for overconfidence in judgment under uncertainty. memory? Journal of Memory and Language, 60,
Journal of Experimental Psychology: Learning, 437–447.
Memory, and Cognition, 34, 1027–1042. Rawson, K. A., & Dunlosky, J. (2007). Improving stu-
Judd, C. M., & McClelland, G. H. (1989). Data analy- dents’ self-evaluation of learning for key concepts
sis: A model-comparison approach. Orlando, FL: in textbook materials. European Journal of Cognitive
Harcourt Brace Jovanovich. Psychology, 19, 559–579.
Kikas, E. (1998). The impact of teaching on students’ Rawson, K. A., Dunlosky, J., & McDonald, S. L.
definitions and explanations of astronomical (2002). Influences of metamemory on performance
phenomena. Learning and Instruction, 8(5), 439–454. predictions for text. Quarterly Journal of
Kintsch, W. (1988). The use of knowledge in discourse Experimental Psychology, 55A, 505–524.
processing: A construction-integration model. Roediger, H. L., & Karpicke, J. D. (2006). The power of
Psychological Review, 95, 163–182. testing memory: Basic research and implications for
Kintsch, W. (1998). Comprehension. Cambridge, UK: educational practice. Perspectives on Psychological
Cambridge University Press. Science, 1, 181–210.
Koriat, A., Bjork, R. A., Sheffer, L., & Bar, S. K. (2004). Thiede, K. W. (1996). The relative importance of
Predicting one’s own forgetting: The role of experi- anticipated test format and anticipated test difficulty
ence-based and theory-based processes. Journal of on performance. Quarterly Journal of Experimental
Experimental Psychology: General, 133, 643–656. Psychology, 49A, 901–918.
Kornell, N., & Bjork, R. A. (2007). The promise and Thiede, K. W. (1999). The importance of monitoring
perils of self-regulated study. Psychonomic Bulletin and self-regulation during multi-trial learning.
& Review, 14, 219–224. Psychonomic Bulletin & Review, 6, 662–667.
Kornell, N., & Bjork, R. A. (2008). Optimizing self- Thiede, K. W., Anderson, M. C. M., & Therriault, D.
regulated study: The benefits—and costs—of drop- (2003). Accuracy of metacognitive monitoring
ping flashcards. Memory, 16, 125–136. affects learning of texts. Journal of Educational
Lipko, A. R., Dunlosky, J., Hartwig, M. K., Rawson, K. Psychology, 95, 66–73.
A., Swan, K., & Cook, D. (2009). Using standards to Thiede, K. W., Griffin, T. D., Wiley, J., & Redford, J.
improve middle-school students’ accuracy at evaluat- S. (2009). Metacognitive monitoring during and
ing the quality of their recall. Journal of Experimental after reading. In D. J. Hacker, J. Dunlosky, &
Psychology: Applied, 15, 307–318. A. C. Graesser (Eds.), Handbook of metacognition in
Miesner, M. T., & Maki, R. H. (2007). The role of test education (pp. 85–106). New York: Routledge.
anxiety in absolute and relative metacomprehension Thomas, A. K., & McDaniel, M. A. (2007).
accuracy. European Journal of Cognitive Psychology, Metacomprehension for educationally relevant
19, 650–670. materials: Dramatic effects of encoding–retrieval
Nietfeld, J. L., Cao, L., & Osborne, J. W. (2006). The interactions. Psychonomic Bulletin & Review, 14,
effect of distributed monitoring exercises and 212–218.
feedback on performance, monitoring accuracy, Zhao, Q., & Linderholm, T. (2008). Adult metacom-
and self-efficacy. Metacognition and Learning, 1, prehension: Judgment processes and accuracy con-
159–179. straints. Educational Psychology Review, 20, 191–206.




Concepts used in Experiment 1

Term Full definition Idea units

Proactive Information already stored in memory interferes with the learning of new (1) Information in
interference information memory
(2) interferes with
(3) new information
Confirmation bias The tendency to only seek out or attend to information that confirms one’s belief (1) Seek out or attend to
and to ignore counter evidence info
(2) confirms one’s belief
(3) ignore counter
Difference threshold The smallest difference between two stimuli that can be detected reliably (1) Smallest difference
(2) between two stimuli
(3) that can be detected