Академический Документы
Профессиональный Документы
Культура Документы
Strand 11
CONTENTS
Chapter Title
1
Introduction
Page
1
13
24
31
Chris Harrison
6
38
50
59
68
76
Strand 11
11
85
93
101
Hildegard Urban-Woldron
14
112
120
128
ii
Strand 11
INTRODUCTION
Strand 11 focuses on the evaluation and assessment of student learning and
development. Many studies presented in other conference strands, of course, involve
the assessment of student learning or of affective characteristics and outcomes such as
students attitudes or interests and use existing instruments or new ones developed
for the study in hand. In such studies, assessment instruments are tools to be used to
try to explore and answer other questions of interest. In strand 11, the emphasis is on
the development, validation and use of assessment instruments; the focus is on the
instrument itself. These can include standardized tests, achievement tests, high stakes
tests, and instruments for measuring attitudes, interests, beliefs, self-efficacy, science
process skills, conceptual understandings, and so on. They may be developed with a
view to making assessment more authentic in some sense, to facilitate formative
assessment, or to improve summative assessment of student learning.
Fifteen papers presented in this strand are included in this book of e-proceedings.
Four of them discuss the development of new or modified instruments to assess
students conceptual understanding of a science topic. Two use the two-tier multiple
choice format that many researchers have found valuable for probing understanding,
to explore the topics of electric circuits and geometrical optics. Another explores the
factors that may underlie the observed patterns in students responses, trying to tease
out the relative importance of mathematical and physical ideas in determining
performance on questions about kinematics. A fourth paper begins the exploration of
a relatively new and novel science domain, systems thinking. Here assessment items
have a particularly significant role to play in helping to define the domain in
operational terms, and facilitating discussion within the science education research
community.
Four papers explore issues concerning the assessment of practical competence and
skills. One looks at the general issue of developing a model to describe progress in
carrying out hands-on activities; another focuses more specifically on experimental
skills in physics; and a third considers performance assessment in the context of initial
teacher education. The fourth paper looks at the potential use of simulations as
surrogates for bench practical activities. Work in this domain is important, as science
educators seek to come to a better understanding of the factors that lead to variation in
students responses to practical tasks.
Three papers look in different ways at the influence of contexts on students answers
and responses to tasks. Two take the PISA studies as their starting point, looking in
detail at the thinking of students as they respond to PISA tasks and questioning the
extent to which the PISA interpretation of authenticity enhances student interest and
engagement with assessment tasks. Both point to the value of listening to students
talking about their thinking as they answer questions, and suggest that this may be
quite different from what we would expect, and perhaps hope. A third paper with an
interest in the effects of contextualisation presents data from a study in Brazil
comparing students answered to sets of parallel questions with fuller and more
abridged contextual information. The findings have implications for item design, and
suggest that reading demands should be kept carefully in check if we aim to assess
science learning.
Strand 11
Three papers in this section explore the formative use of assessment. One has a focus
on the assessment of learning that results from inquiry-based science teaching.
Another looks at the ways in which students respond to formative feedback on their
work. The context for this study is web portfolios, but the research question is one
with wider applicability to other forms of feedback, and across science contents more
generally. The third uses an experimental design to explore the impact on student
learning in a topic on chemical reactions of a self-evaluation instrument that asks
students to try to monitor their own learning and to take action to address areas in
which they judge themselves to be weak.
All of the papers described above collect data from students of secondary school age
or prospective teachers. The final paper in this strand looks at the potential use of an
attitude assessment instrument to predict undergraduate students success in chemistry
learning.
The set of papers highlights the key role of assessment items and instruments as
operational definitions of intended learning outcomes, bringing greater clarity to the
constructs used and to our understanding of learning in the domains that they study.
Strand 11
Strand 11
Strand 11
Teaching according to these new curricula starts with the design of performance
assessments suitable for the assessment of a specific skill and to create a rubric for the
assessment. Thereafter the teacher plans the exercises beneficial for student
development and finally decides the time needed and plans the activities according to
this.
Strand 11
mass transfer, energy transformation, technical design, and phase changes. The latter
is presented here in detail.
Core content
The teaching in science studies should, in this case, according to the curriculum of
primary and secondary school (Skolverket 2010), deal with the core content presented
in Table 1.
Table 1
Core content in Swedish compulsory school curriculum relevant for phase transitions
and scientific studies.
In years 13
Various forms of water:
solids, liquids and gases.
Transition between the
forms: evaporation,
boiling, condensation,
melting and solidification.
Simple scientific studies.
In years 46
Simple particle model to
describe and explain the structure,
recycling and indestructibility of
matter. Movements of particles as
an explanation for transitions
between solids, liquids and gases.
Simple systematic studies.
Planning, execution and
evaluation.
In years 79
Particle models to
describe and explain the
properties of phases, phase
transitions and distribution
processes for matter in air,
water and the ground.
Systematic studies.
Formulating simple
questions, planning,
execution and evaluation.
The relationship between
chemical experiments and
the development of
concepts, models and
theories.
Knowledge requirements
The knowledge requirements are related to the age of the students and show a clear
progression through school. At the end of the third, sixth and ninth year there are
clearly defined knowledge requirements (Table 2). Grades are introduced in the sixth
year and levels for grades E (lowest), C, and A (highest) are described in the
curriculum. Also D and B are being used. Grades D or B means that the knowledge
requirements for grade E or C and most of C or A are satisfied respectively.
Strand 11
Table 2
Knowledge requirements for different years and grades
Year
3
Year
6
Year
9
Based on clear instructions, pupils can carry out [] simple studies dealing with
nature and people, power and motion, and also water and air.
Grade E
Grade C
Grade A
Pupils can talk about and
Pupils can talk about and
Pupils can talk about and
discuss simple questions
discuss simple questions
discuss simple questions
concerning energy.
concerning energy.
concerning energy.
Pupils can carry out
Pupils can carry out
Pupils can carry out simple
simple studies based on
simple studies based on
studies based on given
given plans and also
given plans and also
plans and also formulate
formulate simple
simple questions and
contribute to
formulating simple
questions and planning
planning which after some
questions and planning
which after some
reworking can be
which can be
reworking can be
systematically developed.
systematically developed. systematically developed. In their work, pupils use
In their work, pupils use
In their work, pupils use
equipment in a safe,
equipment in a safe and
equipment in a safe and
appropriate and effective
basically functional way. appropriate way. Pupils
way. Pupils can [] make
Pupils can []
can [] make proposals proposals which can
which after some
improve the study.
contribute to making
proposals that can
reworking can improve
improve the study.
the study.
Pupils can talk about and
Pupils can talk about and
Pupils can talk about and
discuss questions
discuss questions
discuss questions
concerning energy. Pupils concerning energy. Pupils concerning energy. Pupils
can carry out studies
can carry out studies
can carry out studies based
based on given plans and
based on given plans and
on given plans and also
also contribute to
also formulate simple
formulate simple questions
formulating simple
questions and planning
and planning that can be
questions and planning
which after some
systematically developed.
which can be
reworking can be
In their investigations,
systematically developed. systematically developed. pupils use equipment in a
In their studies, pupils use In their studies, pupils use safe, appropriate and
equipment in a safe and
equipment in a safe and
effective way. Pupils apply
basically functional way. appropriate way. Pupils
well developed reasoning
Pupils apply simple
apply developed
concerning the plausibility
reasoning about the
reasoning about the
of their results in relation
plausibility of their results plausibility of their results to possible sources of
and contribute to making and make proposals on
error and make proposals
proposals on how the
how the studies can be
on how the studies can be
studies can be improved.
improved. Pupils have
improved and identify new
Pupils have basic
good knowledge of
questions for further
knowledge of energy,
energy, matter, [] and
study. Pupils have very
matter, [] and show this show this by explaining
good knowledge of energy,
by giving examples and
and showing
matter, [] and show this
describing these with
relationships with
by explaining and showing
some use of the concepts, relatively good use of the relationships between
models and theories.
concepts, models and
them and some general
theories.
characteristics with good
use of the concepts, models
and theories
Strand 11
Strand 11
Table 3
Assessment rubric for assessing skills in an experiment of phase changes
Use of theory
Improvement
of the
experiment
Explanations
Relate
Discuss
Sufficient
The student draws
simple conclusions
partly related to
chemical models and
theories. (I can see
stearic acid in solid,
liquid and gas phase.)
The student discusses
the observations and
contributes with
suggestions of
improvements.
(Observe more
burning candles.)
Good
The student draws
conclusions based on
chemical models and
theories. (The heat of the
candle causes the phase
transfer between the
phases.)
The student discusses
different interpretations
of the observations and
suggests improvements.
(Remove the wick and
relight the candle.)
Better
The student draws well
founded conclusions out of
chemical models and
theories. (Stearic acid
must in gas phase and mix
with oxygen to burn.)
The student discusses well
founded interpretations of
the observations, if they
are reasonable, and
suggests based on these
improvements which allow
enquiries of new
questions. (Heat a small
amount of stearic acid and
try to light the gas phase
above.)
The student presents
theoretically developed
and well founded
explanations. (All phase
changes from solid to
liquid or liquid to gaseous
need energy.)
The student discusses the
occurrence of the
phenomena observed in
everyday life and the use
of it and its impact on
environment, health and
society. (The phase change
from liquid to gaseous
phase cools you down
when you are sweating.)
The student uses the
experiment as a model and
discusses the occurrence
of the phenomena studied
in society and makes
statements and
consequences based on
facts and complicated
physical relations and
theories (The phase
change from liquid to
gaseous phase cools you
down when you are
sweating.)
Strand 11
WORKSHOP
We had prepared four similar activities, all with the same material; candle, wick, and
matchbox but with different purposes. The activities represented studies of mass
transfer, energy transformation, technical design, and phase changes. At the workshop
three groups were formed, omitting the study of technical design. The three groups
were not informed about the differences between the aims of their experiments. The
groups were constructed to include people with as varied background as possible.
Thus, participants from one specific country or similar fields as chemistry or physics
were allocated to different groups. They performed elaborative tasks similar to those
used by teachers working at different levels, assessed the performance and evaluated
the learning outcome of the activity. Within each group one person was selected to do
the assessment of activities the others made. The person assessing the work should
focus not only on the results of the discussions within the group but also try to
evaluate the process, as the aim was to assess the skills of the participants rather than
the content of their knowledge.
Discussion
The aim was to demonstrate of how peer reviewing within the group may be used for
producing information of several kinds beneficial for the performance assessment of
science education at school. Discussions arose among the participants about how an
integrated approach, especially in relation to other subjects in school, improved the
usefulness of the methods. Learning by doing followed by discussions became the
major part of the workshop with sharing of ideas and suggestions for further
development.
Most of the participants had weak knowledge of assessments of practical skills and
expressed their astonishment of the positive result of the workshop and showed
curiosity to use the method. Some of the participants also showed didactic skills when
explaining the different aspects of the experiment they mastered to the others, a good
example of the importance of variation in the skills of group members.
The persons who made the assessments expressed the need of further practicing. They
realized the complexity in assessing different skills at the same time as assessing the
grade. They also expressed a will to develop this ability as they realized the strength
in assessing several skills at one occasion. Further, the participants noted the
importance of questions like the last on in the instructions (Appendix) in order to
assess the quality of the relation between theory and practice.
Conclusion
Although, based on a simple experiment of a burning candle, the workshop gave a
opportunity to discuss and understand theories being regarded as difficult to
understand from the viewpoint of the student or difficult to teach from the teachers
view. The experiments, although similar, were of different character, thus, reflecting a
wide spectrum of possibilities.
Thus, the activities performed may be seen as models or examples possible to further
develop new assessments according to the content of the subject.
10
Strand 11
REFERENCES
Arter, J. A. & McTighe, J. (2001). Scoring rubrics in the classroom, Corwin
Audesirk, T., Audesirk, G., & Byers, G B. (2008). Life on earth, 5 ed., San Francisco,
Pearson Education.
Busching, B. (1998). Grading Inquiry Projects. New Directions for Teaching and
Learning 74: 8996.
Chanock, K. (2000). Comments on Essays: do students understand what tutors write?
Teaching in Higher Education 5 (1): 95105.
Hewitt, P. G., Suchocki, J. & Hewitt, L. A. (2008). Conceptual physical science, 4 ed.
San Francisco, Pearson Education.
Jnsson, A. (2011). Lrande bedmning. Gleerups.
Lea, M.R. & Street B.V. (2006). The Academic Literacies Model: Theory and
Applications. Theory into Practice, 45(4): 368377.
Reece, J.B., Urry, L.A., Cain, M.L., Wasserman, S.A., Minorsky, P.V. & Jackson, R.
B. (2011). Campbell Biology Global Edition, Pearson.
Sadler, D.R. (1987). Specifying and Promulgating Achievement Standards. Oxford
Review of Education 13(2): 191209.
Skolverket (Swedish National Agency for Education). (2010) Curriculum for the
compulsory school, preschool class and the recreation centre 2011. Skolverket.
Trefil, J. & Hazen R.M. (2010). Sciences an integrated approach. Wiley Eurasian
Journal of Mathematics, Science & Technology Education 8(1).
11
Strand 11
APPENDIX
INQUIRY OF A BURNING CANDLE
Strand 11
BACKGROUND
The rate of advancements in scientific knowledge and technology and the widespread
demands on young people to participate actively in solving problems in almost every
aspect of our lives have reoriented the role of education in general and science teaching
in particular. Nowadays, science teaching aims at developing scientifically literate people
with flexible thinking skills and an ability to participate critically in meaningful
discourse. More specifically, it aims at helping students acquire positive attitudes towards
learning and science, a variety of experiences, conceptual understanding, epistemological
awareness, practical and scientific skills and creative thinking skills (Constantinide,
Kalyfommatou & Constantinou, 2001).
The definitions of systems thinking described in the literature (e.g., Senge, 1990; Thier &
Knott, 1992; Booth Sweeney, 2001; Ben-Zvi Assaraf & Orion, 2005) include thinking
about a system, meaning a number of interacting items that produce a result over a period
of time. According to the Benchmarks for Science Literacy (AAAS; 1993), systems
thinking is an essential component of higher order thinking, whereas Kali, Orion and
13
Strand 11
RESEARCH METHODOLOGY
Systems Thinking Assessment (STA): purpose and specifications
The STA will be used to measure the quality of thinking about systems by children aged
10-14 and the effectiveness of curricula designed to promote systems thinking. It consists
of multiple-choice items in the context of everyday phenomena, familiar to the children
of the specific age range. The stems of the items include a scenario and children are
asked to choose the best possible answer, amongst four alternatives.
Multiple choice items have advantages and disadvantages. Given that every other criterion
was taken into account, grading a multiple choice test is objective, since a grader would
mark an item in the same way as anybody else. Besides, a short amount of time is needed
to administer many items, in order to sufficiently cover the content domain under study.
They are also more reliable than other forms of questions, since, in a possible
readministration of a test, it is more likely that a subject will produce the same answers if
the questions are multiple choice than if they are open-ended. A basic disadvantage of
multiple choice questions is that they do not provide much information on the subjects
thinking processes, namely the reasons for which they answer each item the way they do.
Nevertheless, the procedure of the tests development and Rasch analysis minimize the
effect of this disadvantage on the results.
14
Strand 11
In order to be able to make generalizations, there was an intentional effort to include items
that utilize various systems: physical-biological systems (such as water cycle, a forest, a
dam or food webs), mechanical-electrical systems (such as a bicycle or a car) and
socioeconomic systems (such as a family, a village or a store). Moreover, where possible,
a picture or a diagram was added in the items wording, so as to make the item clearer and
the test more eye-pleasant.
We have adopted the following operational definition of systems thinking, which relies
on four strands:
(a) System definition includes identifying the essential elements of a system, its
temporal boundaries and its emergent phenomena. (b) System interactions includes
reasoning about causes and effects when interactions are taking place within the system.
(c) System balance refers to the abilities of recognizing the relation between
interactions and the systems balance. (d) Flows refers to reasoning about the relation
of inflows and outflows in a system and recognizing cyclic flows of matter or energy.
Educators
(face validity)
Students
(test admin. and
interview data)
(construct,
criterion and face
validity,
reliability)
Systems
Thinking
(Abilities and
items)
Literature
(content validity)
Experts
(content validity)
15
Strand 11
The STA has already undergone its first cycle of development. Reviewing the literature
led to 13 abilities that seemed to define Systems Thinking. The original items were
developed and administered to a small number of 10-year-old students. Qualitative and
quantitative data led to modifications (content and wording changes) and the
development of new items. Two experts gave feedback on the tests content validity.
Further improvements were carried out and two educators with experience with children
aged 10-14 years old examined the face validity of the test. The revised version was once
again administered to a small number of 10-year-old students and after the necessary
modifications the final form of the test with 52 multiple-choice items was administered
to 900 students. Rasch modeling led to a scale showing items difficulty and students
ability.
Based on a broader literature review and the development of separate examples regarding
each ability, the second development cycle began with revising the 13-ability schema and
reducing the abilities to 10 and the items to 41. The revised test was given to
approximately 90 10-14-year-old students. Test and items difficulty indices, items
discrimination indices and frequencies were calculated and, items were either modified or
replaced. Afterwards, 16 students participated in interviews, answering the items and
following a think-aloud protocol (Ericsson & Simon, 1998). Non-effective items were
replaced or modified.
The latest version of the items is under evaluation by independent experts. Graduate/PhD
students in Learning in Science, academics specialized in Science Teaching or
Psychology and international researchers with experience on Systems Thinking
measurement will provide feedback on the test by solving it first, and by judging its
efficiency based on a structured protocol. Finally, an expert panel will be formed,
during which any problems will be discussed until the panel reaches consensus. The
revised test will be given to four educators to evaluate its face validity. The test will then
be administered to 100 10-14-year-old students to statistically assess its clarity and its
developmental validity. The improved test will finally be administered to 500 students
and the data will be analyzed using Rasch modeling. Confirmatory Factor Analysis will
be carried out in order to assess the 10-ability structure of the construct.
RESULTS
At the final stage of the first cycle of the STA development, the test was administered to
about 900 students. Rasch statistical model provided a scale for the 52 items of the STT,
where both subjects score and items degree of difficulty are presented (Figure 2).
It is evident that the 52 items of the test fit the model well. Both students scores and
items degree of difficulty are distributed uniformly on the scale. Students scores vary
between -2.16 and 2.37 logits, whereas the items degree of difficulty varies between 2.41 and 2.53 logits.
16
Strand 11
Strand 11
Table 1
Statistical values for the 52 STA items for the whole sample and the four groups
Statistical indices
Mean (items*)
(persons)
Standard deviation (items)
(persons)
Separability** (items)
(persons)
Mean Infit mean square (items)
(persons)
Mean Outfit mean square (items)
(persons)
Infit t (items)
(persons)
Outfit t (items)
(persons)
Total
sample
5th Gr.
6th
Prima
grade
ry
Primary
(n=848
)
0.00
-0.01
0.97
0.72
0.99
0.81
1.00
1.00
1.01
1.01
-0.12
-0.04
0.09
0.02
(n=21
9)
0.00
-0.30
0.96
0.66
0.97
0.77
1.00
1.00
1.02
1.02
-0.13
-0.07
0.05
0.04
(n=249)
0.00
-0.05
0.97
0.73
0.98
0.81
1.00
1.00
1.02
1.02
-0.03
-0.03
0.09
0.03
1st
grade
Secon
d.
(n=13
7)
0.00
0.14
1.08
0.73
0.96
0.81
1.00
1.00
1.01
1.01
0.00
-0.02
0.05
-0.01
2nd
grade
Secon
d.
(n=24
3)
0.00
0.21
1.03
0.68
0.98
0.78
1.00
1.00
1.01
1.01
0.04
-0.01
0.08
0.02
*L=52 items
** Separability: value=1 shows great reliability, whereas value=0 very little reliability
Table 1 shows the statistical values of Rasch statistical model for the whole sample and
the four subgroups (5th and 6th primary grades and 1st and 2nd secondary grades)
separately. It is evident that, for the whole sample and the subgroups, items reliability
values are over .95, whereas subjects reliability values are over .76. Although the
generally accepted values for such a scale are over .90 (Wright, 1985), the subjects
reliability may be accepted. Furthermore, Mean Infit mean square for both items and
subjects equals to 1 for the whole sample and the subgroups, while Mean Outfit mean
square is either 1.01 or 1.02. Infit t and Outfit t, range from -0.13 to 0.09. Subjects
Standard Deviation is rather small (SD=0.72), indicating uniformity in the samples
behavior. Namely, students aged 10-14 respond to STT as an unvarying group. Besides,
the subjects mean score increases with age, suggesting developmental validity of the
test. Rasch analysis also showed that the items receive infit values from .87 to 1.18,
which fit the generally accepted range .77-1.30 (Adams & Khoo, 1993). Three of the
items have an outfit value over 1.30, but since the difference between infit and outfit
values for these items is small, they remain in the test.
18
Strand 11
This is an on-going study and, at the moment, the test is under its second cycle of
development. Test administration and interviews with students, feedback from experts
and educators provide data to validate the items. The way data from each stage were
analyzed is indicated in the Tables 1 and 2 that are presented in the next subchapter. At
the end of the second cycle, Rasch analysis, as well as confirmatory factor analysis will
be conducted and results will be published.
To
experts
To
educat
ors
Comments
Action
B
0,09
C
0,32
D
0,18
OK
Change
wording
of main
body
and
alternati
ves
Keep as
is
Keep as
is
19
Strand 11
Pilot
Final
admini
stration
2nd
cycle
Prepilot
Intervi
ews
(first
set)
Intervi
ews
(secon
d set)
A
0,56
B
0,19
C
0,00
D
0,25
A
0,40
B
0,07
B
0,15
C
0,16
C
0,24
D
0,18
Revise
distract
or
Change
wording
of the
stem
Keep as
is
D
0,21
Change
alternati
ve
content
Keep as
is
20
Strand 11
Table 3
The development of the apple tree item
1st cycle
Pre-pilot
Translation in English
-
Comments
-
Action
-
To experts
To
educators
Keep as is
Keep as is
Pilot
Final
administra
tion
2nd cycle
Pre-pilot
0,38
0,31
0,13
0,19
B
0,18
C
0,25
Keep as is
D
0,07
Item
replaced
B
0,21
C
0,28
D
0,07
21
Strand 11
CONCLUSION
Systems thinking is a higher order skill, important in dealing with everyday phenomena
and in solving problems. At the same time, science is a field with plenty of models to
analyze and model. Despite the widespread research on curriculum development on
systems thinking, no validated tests have been developed to evaluate their effectiveness.
STA is developed following a cyclic and iterative procedure. It aspires to be a useful
instrument in assessing a curriculum designed to promote systems thinking in upperprimary and lower-secondary school students.
REFERENCES
Adams, R. J. & Khoo, S. T. (1993). Quest: The Interactive Test Analysis System.
Camberwell, Victoria: ACER.
American Association for the Advancement of Science (1993). Benchmarks for science
literacy. New York: Oxford University Press: Author.
Constantinide, K., Kalyfommatou, N. & Constantinou, C. P. (2001). The development of
modeling skills through computer based simulation of an ant colony. In
Proceedings of the Fifth International Conference on Computer Based Learning
in Science, July 7th July 12th 2001, Masaryk University, Faculty of Education,
Brno, Czech Republic.
Ben-Zvi Assaraf, O. & Orion, N. (2005). Development of System Thinking Skills in the
Context of Earth System Education. Journal of Research in Science Teaching, 42
(5), 518560
Booth Sweeney, L. B. (2001). When a butterfly sneezes. Pegasus Communications, Inc,
Waltham.
Ericsson, K. A. and Simon, H. A.(1998). How to Study Thinking in Everyday Life:
Contrasting Think-Aloud Protocols With Descriptions and Explanations of
Thinking. Mind, Culture and Activity, 5, 178-186.
Hmelo-Silver, C. E. and Green Pheffer, M. (2004). Comparing expert and vonice
understanding of a complex system prom the perspective of structures, behaviors,
and functions. Cognitive Science, 28, 127-138.
Kali, Y., Orion, N., & Eylon, B. (2003). The effect of knowledge integration activities on
students perception of the earths crust as a cyclic system. Journal of Research in
Science Teaching, 40, 545565.
Riess, W., & Mischo, C. (2009). Promoting Systems Thinking through Biology Lessons.
International Journal of Science Education, 1-21.
22
Strand 11
Senge, P. (1990). The Fifth Discipline: The Art and Practice of the Learning
Organization. New York: Doubleday.
Sheehy, N., Wylie, J., McGuinness, C. & Orchard, G. (2000). How Children Solve
Environmental Problems: using computer simulations to investigate systems
thinking. Environmental Education Research, 6, 2, 109-126.
Thier, H. D. & Knott, R. C. (1992). Subsystems and Variables. Teachers guide, Level 3,
Science Curriculum Improvement Study. Delta Education, Inc., Hudson.
23
Strand 11
INTRODUCTION
Despite everyday experience with light, understanding geometrical optics turns out to be
difficult for students. Physics education research shows that students hold numerous
conceptions about optics which differ from scientifically adequate concepts (Duit 2009).
Alternative conceptions are very stable. Research shows that formal instruction is
frequently not able to transform them into scientifically accepted ideas (Andersson und
Krrqvist 1983; Fetherstonhaugh und Treagust 1992; Galili 1996; Langley et al. 1997).
Teachers knowledge about their students learning difficulties is one important
prerequisite for the design of successful instruction. Exploring students conceptual
knowledgebase can provide important feedback: It can support students in their individual
learning process and can serve as basis for further teaching decisions.
In general, there are two main methods used for examining students conceptual
knowledge: Interviews and open ended questionnaires. The most effective methods like
interviews are very time consuming and difficult to handle for teachers in classroom
situations. In search for alternatives out of this dilemma, we encountered the method of
two-tier tests as used by e.g. Treagust 2006; Law & Treagust 2008. Two-tiered test items
are items that require an explanation or defence for the answer [] (see Wiggins and
24
Strand 11
McTighe 1998, p. 14) (Treagust 2006). Each item consists of two parts, called tiers. The
first part of the item is a multiple-choice question which consists of distractors including
known student alternative conceptions. In the second part of each item, students have to
justify the choice made in step one by choosing among several given reasons (Treagust
2006).
Research on alternative conceptions in optics has mainly used the methods of interviews or
questionnaires with open answers (Andersson und Krrqvist 1983; Driver et al. 1985;
Guesne 1985; Viennot 2003). In addition, multiple-choice tests were developed (Bardar et
al. 2006; Chen et al. 2002;Chu et al. 2009; Fetherstonhaugh und Treagust 1992). These
tests focus on various age-groups and on different content areas within geometrical optics.
We have, however, not found a psychometric valid test-instrument designed to portray
basics conceptions in geometrical optics of students on the lower secondary level.
Our main research objective is the development of a multiple-choice test-instrument for
year-8 students which is able to portray the students conceptions in geometrical optics.
25
Strand 11
classes and thus had 8 different physics teachers. The schools our sample attended
contained all different types of schools available in Austria at year-8 level.
The interviews were conducted in the school setting. Each student was interviewed
individually. The average duration of the interviews was 19.5 minutes.
METHOD
We carried out semi-structured, problem based interviews (Lamnek, 2002; Mayring, 2002;
Witzel, 1985). The interviews were based on seven selected items of the second test
version. The students were just given the item task without any distractors. The interview
followed a four step structure for each item. The students had to:
Data analysis
The interviews were recorded and transcribed. Afterwards they were analysed with
MAXQDA following the method of qualitative content analysis by Mayring (2010) and
Gropengieer (2008).
The data was analysed concerning three main categories: language issues, the forms of
visual representations used and students conceptions related to the content of the items. As
far as language issues are concerned, we were interested how students interpreted the task
of the item on basis of the text given. Additionally, we tried to identify unfamiliar words
and expressions as well as too long or complicated sentences.
For the visual representations our main aim was to find out if the students were able to
grasp the content or the situation represented in visual form.
The final category on students conceptions was supposed to analyse the response space
concerning the problems posed and so to get a good overview of students conceptions
related to the problem posed.
26
Strand 11
FINDINGS
The findings presented here are results of the empirical testing of the second test version
(N=376). The reliability of the test was established by a Cronbach alpha coefficient of
=0.77. An overview of the test and item statistics concerning the 20 two-tier items is
given in figure 2.
27
Strand 11
Figure 3. One-tier item of test version two concerning the key idea of continuous
propagation of light
For those students who indicated in the first tier that they supposed a different distance of
propagation of light from the campfire during day and night, we got 6 different categories
of reasons as shown in figure 4.
Figure 4. Reasons for a different propagation distance of light from a campfire during day
and night
Each of these categories was retranslated into students language taking either a student
statement directly from the interviews or modifying a student statement slightly in order to
fulfil psychometric guidelines for distractor construction. This procedure led to the second
tier for this item as presented below in figure 5.
28
Strand 11
Figure 5. Two-tier item of test version two concerning the key idea of continuous
propagation of light
CONCLUSION
In conclusion, the analysis of the second test version showed that two-tier items of the test
are well able to portray several types of students conceptions known from literature. On
the other hand, results indicated that some items needed still revision and improvement.
The results obtained by interviews were integrated and make up the third test-version,
which needs to be tested.
REFERENCES
Andersson, B.; Krrqvist, C. (1983): How Swedish pupils, aged 12-15 years, understand
light and its properties. In: IJSE 5 (4), S. 387402.
Bardar, E.M; Prather, E.E; Brecher, K.; Slater, T.F (2006): Development and validation of
the light and spectroscopy concept inventory. In: Astronomy Education Review 5, S.
103.
Chu, H.E; Treagust, D.; Chandrasegaran, A. L. (2009): A stratified study of students'
understanding of basic optics concepts in different contexts using two-tier multiplechoice items. In: RSTE 27, S. 253265.
Colin, P.; Chauvet, F.; Viennot, L. (2002): Reading images in optics: students difficulties
and teachers views. In: IJSE 24 (3), S. 313332.
Driver, R.; Guesne, E.; Tiberghien, A. (Hg.) (1985): Children's ideas in science.
Buckingham: Open University Press.
Duit, R. (2009): BibliographySTCSE: Students and teachers conceptions and science
education. Retrieved October 20, 2009.
Duit, R.; Treagust, D.F (2003): Conceptual change: a powerful framework for improving
science teaching and learning. In: IJSE 25 (6), S. 671688.
29
Strand 11
30
Strand 11
BACKGROUND
The European Parliament and Council (2006) identified and defined the key
competencies necessary for personal fulfillment, active citizenship, social inclusion
and employability in our modern day society. These included communication skills
both in mother tongue and foreign languages, mathematical, scientific, digital and
technological competencies, social and civic competencies, cultural awareness and
expression, entrepreneurship and learning to learn. These key competencies formed
the foundation for the approach that our European Framework 7 project (EUFP7)
Strategies for Assessment of Inquiry Learning in Science Project (SAILS) took to
developing, researching and understanding how teachers might strengthen their
teaching of inquiry-based science education.
Since the Rocard Report (2007) recommended that school science teaching should
move from a deductive to an inquiry approach to science learning, there have been
several EUFP7 projects such as S-TEAM, ESTABLISH, Fibonacci, PRIMAS and
Pathway,.whose remit has been to support groups of teachers across Europe in
bringing about this radical change in practice. These projects have been successful in
highlighting the importance of IBSE across Europe. They also have enabled us to
determine the range of understanding of what the term inquiry means to teachers
across Europe, and to establish to what extent skills and competencies that are
developed through inquiry practices have been identified. The term inquiry has
figured prominently in science education, yet it refers to at least three distinct
categories of activitieswhat scientists do (e.g., conducting investigations using
scientific methods), how students learn (e.g., actively inquiring through thinking
and doing into a phenomenon or problem, often mirroring the processes used by
scientists), and a pedagogical approach that teachers employ (e.g., designing or
using curricula that allow for extended investigations) (Minner et al, 2009).
Inquiry-based science education (IBSE) has proved its efficacy at both primary and
secondary levels in increasing childrens and students interest and attainments levels
(Minner et al, 2009: Osborne et al, 2008) while at the same time stimulating teacher
31
Strand 11
motivation (Wilson et al, 2010). One area that has remained problematic for teachers
and cited as one of the areas limiting the development of IBSE within schools has
been assessment. (Wellcome, 2011). This EUFP7 project Strategies for Assessment of
Inquiry Learning in Science (SAILS) aims to prepare science teachers, not only to be
able to teach science through inquiry, but also to be confident and competent in the
assessment of their students learning through inquiry. The literature on teacher
change suggests that teacher change is a slow (and often difficult process and none
moreso than when the initiative requires teachers to review and change their
assessment practices (Harrison, 2012).
Part of the reason for this slow implementation of IBSE in science classrooms is the
time lag that happens between introducing ideas and the training of teachers at both
inservice and preservice level. While this situation should improve over the next few
years, there is a fundamental problem with an IBSE approach and this lies with
assessment. While the many EU IBSE projects have produced teaching materials,
they have not produced support materials to help teachers with the assessment of this
approach. Linked to this is the low level of IBSE type items in national and
international assessments which gives the message to teachers that IBSE is not
considered important in terms of skills in science education. It is clear that there is a
need to produce an assessment model and support materials to help teachers assess
IBSE learning in their classrooms if this approach is to be further developed and
sustained in classrooms across Europe.
Inquiry Skills
Inquiry skills are what learners use to make sense of the world around them. These
skills are important both to create citizens that can make sense of the science in the
world they live in so that they make informed decisions and also to develop scientific
reasoning for those undertaking future scientific careers or careers that require the
logical approach that science encourages. An inquiry approach not only helps
youngsters develop a set of skills such as critical thinking that they may find useful in
a variety of contexts, it can also help them develop their conceptual understanding of
science inquiry based science education (IBSE) and encourages students motivation
and engagement with science.
The term inquiry has figured prominently in science education, yet it refers to at least
three distinct categories of activitieswhat scientists do (e.g., conducting
investigations using scientific methods), how students learn (e.g., actively inquiring
through thinking and doing into a phenomenon or problem, often mirroring the
processes used by scientists), and a pedagogical approach that teachers employ
(e.g., designing or using curricula that allow for extended investigations) (Minner,
2009). However, whether it is the scientist, student, or teacher who is doing or
supporting inquiry, the act itself has some core components.
Inquiry based science education is an approach to teaching and learning science that is
conducted through the process of raising questions and seeking answers (Wenning,
2005, 2007) . An inquiry approach fits within a constructivist paradigm in that it
requires the learner to take note of new ideas and contexts and question how these fit
with their existing understanding. It is not about the teacher delivering a curriculum
of knowledge to the learner but rather about the learner building an understanding
through guidance and challenge from their teacher and from their peers.
32
Strand 11
In our view, these inquiry skills are developed and experienced through working
collaboratively with others and so communication, teamwork, and peer support are
vital components of inquiry classrooms.
Within an inquiry culture there is also a clear belief that student learning outcomes are
especially valued. One characteristic of inquiry learning is that students are fully
involved in the active learning process. Students who are making observations,
collecting data, analyzing data, synthesizing information, and drawing conclusions are
developing problem-solving skills. These skills fully incorporate the basic and
integrated science process skills necessary in scientific inquiry. In England, there has
been a move to support more practical work in science classrooms, through the Get
Practical Project (Abrahams et al, 2011). This project has worked through the
Association of Science Education and the National Science Learning Centre and
supported primary and secondary teachers in 30 schools in developing their practice
through practical work resulting in observable changes in the emphasis given to
practical work in schools and also to improvements in the learning of science
concepts. The findings of this project also included an important caveat and that was
that what was required was more than being aware of the Get Practical message.
They found that teachers needed to plan scaffolding (Wood et al, 1976) in order for
their learners to be guided towards viewing scientific phenomena in a similar way to
what their teachers perceive it (Ogborn et al, 1996; Lunetta, 1998). Such an approach
requires the teachers to take note of what their learners struggle with and then plan
and implement teaching that helps their pupils improve. I other words the approach
that teachers need to take is formative.
A second characteristic of inquiry learning is that students develop the lifelong skills
critical to thinking creatively, as they learn how to solve problems using a logic and
reasoning. These skills are essential for drawing sound conclusions from experimental
findings. While many projects have focused on the evaluation of conceptual
understanding of science principles developed, there is a clear need to evaluate other
key learning outcomes, such as process and other self-directed learning skills, with
the aim to foster the development of interest, social competencies and openness for
inquiry so as to prepare students for lifelong learning. This has been the aim of many
of the EUFP7 projects so far and central to this approach is teamwork and
collaborative behavior. So the move to implement more IBSE type learning across
Europe has been successful in terms of raising awareness of the importance of this
33
Strand 11
approach but the introduction of these ideas into mainstream teaching and learning
has been less readily taken up.
In many schools, we know that generally science practicals are presented as recipes to
follow so that students experience scientific phenomena. This approach means that
the raising of questions about phenomena lies with the teacher rather than the student.
So, in most science practicals, the student role is limited to simply collecting and
presenting data that is then made sense of by the teacher. This approach to practical
work is unlikely to aid conceptual understanding and development of inquiry skills
beyond practice of a limited number of skills.
Assessment Approach
The Strategies for Assessment of Inquiry Skills in Science Project (SAILS) consists
of 14 partners from across Europe and is currently in its second year of development.
The prime aim of this project is to produce and trial assessment models and materials
that will help teachers assess inquiry skills in the classroom. At the centre of this work
is Assessment for Learning. Two of the lead members of the Kings College London
team Chris Harrison and Paul Black have been working with a pilot group of 16
expert science teachers developing the first round of materials for the project. The
materials produced are then being trialled in 13 different countries to see how the
approach fits within different cultural contexts. Three topics have been selected for
the first set of materials Food, Rates of Reaction and Speed and Acceleration.
Since the formative use of the assessment data is essential to drive the pedagogy most
likely to bring about conceptual change in the learners, our approach has been first to
strengthen the formative assessment that occurs within inquiry teaching. So SAILS
teachers need to recognize and collect the assessment data that arises directly from
inquiry lessons. To do this they need to think carefully about the variety of ways in
which learners might respond to the new ideas or new contexts or challenging
question being offered. By listening carefully to classroom discussions during inquiry
or to solutions to problems that have arisen during the inquiry or to group reflections
on an inquiry, teachers can gather evidence of their learners emerging understanding.
Teachers can note misconceptions, identify partly answered questions from full
answers, and recognize errors and possible reasons why such errors are occurring.
Such data is rich in inquiry lessons because the very nature of the approach means
that the lesson is challenging and so understanding is interrogated. The teacher can
then use this assessment data to scaffold the next stage in learning for their students.
Such data place teachers in a good position to sum up the progress and to have a
realistic awareness of each learners understanding by the end of the learning
sequence of activities.
This type of assessment has high validity. It satisfies one of the conditions for validity
in having high reliability, in that the learner is assessed on several different occasions,
thereby compensating for variations in a learners performance from day to day, and
in several ways, thereby sampling the full range of learning aims. The fact that the
learner has been assessed in contexts which have been interspersed with the learning
secures both coverage and authenticity, particularly because the teacher is able to test
and re-test her interpretations of what the data mean in relation to each individuals
developing understanding. Such data place teachers in a good position to sum up the
progress and to have a realistic awareness of each learners understanding by the end
34
Strand 11
of the learning sequence of activities. This is radically different from assessing the
learner in the artificial context of the formal test, and it is far more valid i.e. the
teacher can be far more confident in reporting to a parent, or to the next teacher of
the learner, or to any others who might want to have and use assessment results
about the learners potential to both use and to extend her learning in the future.
FINDINGS
The SAILS pilot so far looks promising. Teachers have reported that they feel that
they gain far more evidence of student performance by collecting evidence during the
inquiry activities than from marking reports of the inquiry. They have realized that
only a limited number of skills can be assessed if the evidence is only sourced from
the written report and many of the interchanges they witnessed as students discussed
which inquiry questions were likely ones to form the inquiry and then how to identify,
select and control and manipulate variables were much richer in reality than in the
written reports of the investigation. This is because, by the time the students have
produced a final report of the inquiry, the ideas been through so many iterative
interchanges, that the data had been reduced to stark statements that do not capture
their developing inquiry skills and capabilities. While the written reports indicated
whether the students could or could not identify relevant variables, the ease with
which they could do this and their competence in justifying one variable as testable
and rejecting another was far better portrayed during the inquiry than in their written
reports.
Teachers also recognized that as well as getting a better feel for their students
capabilities, there were some areas that were better assessed during the inquiry than
could be done by other assessment methods and they discussed the limitations of the
previous system of assessing inquiry by coursework and also those of the current
system for assessing inquiry by controlled assessments. This meant that a far wider
range of inquiry skills were assessed than the teachers had previously attempted to do,
when the assessment was focused on the written report of the investigation. The
teachers were especially interested in assessing students capability to raise
investigable questions, their cooperation and teamwork behaviour and their resilience
in learning from their mistakes.
However, engaging in more inquiry in their classrooms and assessing in this different
way also caused concerns and dilemmas for the SAILS pilot teachers. These were:
Teachers unable to collect data on every student during each inquiry activity
Teachers working formatively and so unsure on what they should report students first attempt, last attempt, average attempt
These concerns are shadowed by continual concerns by many of the SAILS pilot
teachers on public and government confidence in teacher assessment and how the
teachers might communicate to parents and others why and how a more formative
approach can be as robust as the assessment judgments that are made through
examinations at the end of courses.
The teachers were also are able to feed evidence back into their teaching and so
respond formatively to both the needs and progress of learners. The teachers also
reported that they had begun to see the inquiry capabilities of their learners more
35
Strand 11
positively then they had done when previously doing practical work with these
youngsters. The teachers were surprised by how well the learners managed to raise
inquiry questions, how innovative the learners could be when not limited to following
a particular path to solving in inquiry problem and how learners were willing to learn
from their mistakes while still remaining motivated.
The SAILS pilot teachers reported that they gave far more curriculum time to inquiry
than they had anticipated was possible at the start of the project. After each meeting
teachers were asked to try, as a minimum, one inquiry project of around an hour. All
16 teachers did considerably more than this with several teachers doing extended
inquiry projects over several weeks and the majority trying 3-6 inquiries with classes
between January and June. As the teachers gained more confidence with the IBSE
approach, the inquiry activities became more open in their structure and direction and
several of the teachers reported that this more open approach not only further
motivated learners, it allowed the teachers to assess the learners on a wider range of
inquiry skills. Certainly in the first few inquiry activities teachers focused on aspects
of planning or of data collection whereas in the more inquiry style activities teachers
felt more confident to also assess broad-reaching skills such as teamwork and
communication.
CONCLUSION
Work so far on the SAILS project has indicated that teachers are willing to strengthen
their commitments to IBSE through taking a formative assessment approach to
inquiry. The SAILS pilot teachers have demonstrated that they can assess as the
inquiry learning is taking place and then use this assessment data to inform later
stages in the IBSE learning. The formative approach to assessment of inquiry in
science classrooms has encouraged teachers to allow students to do more IBSE type
work than previously and to take a more open approach to inquiry and this has
enabled the students to be more innovative in their inquiry approach. In turn, because
the students are expressing a broader range of skills than the science teachers
normally observe in general practical work, the teacher shave reported that they have
been surprise and pleased by students inquiry capabilities and willingness to learn
from making mistakes.
Issues relating to public confidence in teacher assessment remained problematic and
we hope to address that issue in the coming year through both looking at how science
teachers in our partner countries across Europe work with these ideas and also by
helping the SAILS pilot teachers in England address how they might build an
assessment portfolio of their learners work in inquiry over the course of the school
year.
For more information on the SAILS project see www.sails-project.eu
REFERENCES
Abrahams, I., Sharpe, R. & Reiss, M.(2011) Getting Practical: Improving practical
work in science. Association for Science Education:Hatfield
Abrahams, I. & Millar, R. (2008) Does practical work really work? A study of the
36
Strand 11
37
Strand 11
INTRODUCTION
In the last two decades investigations in physics teaching at the high school and
undergraduate level have shown that a majority of science students have difficulties to
understand physics concepts (Hake, 1998; Halloun & Hestenes, 1985). Students often
attend classes with solid initial misconceptions. Conventional physics instruction
produces only little changes in their conceptual knowledge. The students may know
how to use formulas and calculate certain numerical problems but they still fail to
comprehend the physics concepts. The mentioned studies indicate that instruction can
only be effective if it takes into account the student preconceptions. The proper
concepts have to be learned but also the misconceptions have to be unlearned
(Wagner & Vaterlaus, 2011). This requires the diagnosis of student concepts and
misconceptions.
We have designed a diagnostic test with the purpose of identifying the student
concepts and misconceptions in kinematics at the high school level. The test is based
on the following list of kinematics concepts:
38
Strand 11
The list of concepts has been verified by experts and is in good agreement with the
concepts identified in other studies (e.g. Hestenes, Wells & Swackhamer, 1992).
The development of a new kinematics test has been necessary, since so far there exists
no test that allows measuring the student concept knowledge for each concept
separately. The FCI (Hestenes, Wells & Swackhamer, 1992) and the MBT (Hestenes
& Wells, 1992) are mainly used as tests to evaluate the overall dynamics concept
knowledge. They actually both contain items that correspond to the concepts
mentioned above. However, the number of these items is too small to analyze each
concept separately. The Motion Conceptual Evaluation (Thornton & Sokoloff, 1998)
and the Test of Understanding Graphs in Kinematics (Beichner, 1993) on the other
hand are rather based on task-related objectives than on concepts. The items can
therefore not clearly be linked to the concepts listed above.
We have analyzed student responses to our kinematics test addressing the following
questions: Is the test a valid instrument to determine student concept knowledge about
kinematics? Do the students answer coherently referring to the suggested concepts?
What are the consequences of the test results on teaching?
In order to treat the first two issues we carry out an exploratory factor analysis similar
to the factor analysis of the FCI data done by Scott and Schumayer (2012). Factor
analysis is a standard technique in the statistical analysis of educational data sets and
is described in detail in many pieces of literature (e.g. Merrifield, 1974, Bhner,
2011). The goal of a factor analysis is to explain the correlations among the items in
terms of only a few fundamental entities called factors or latent traits. A latent trait is
interpreted as a characteristic property of the students and made visible while
attempting to answer the items. The degree to which a student possesses a particular
trait determines the likelihood to answer a particular item correctly. Thus the items are
the manifested indicators of the latent factors. Scott and Schumayer (2012) point out
that it is important to distinguish between "factors" and "concepts". In our context the
concepts are constructs defined by experts while the factors represent the coherence of
the student thinking. The interesting issue is whether the association of items seen by
an expert agrees with the association of questions seen by students.
Referring to the third issue we suggest how the test can be applied at school in a
formative way. We show how by means of a latent class analysis different groups of
students with a similar profile of concept knowledge can be found and how a
characterization of these groups can help the teacher to prepare individual material for
every group.
The following section describes the methods including the test instrument, the
collection of data and the exploratory factor analysis. Thereafter the results of the
factor analysis are presented and interpreted. The last two sections are devoted to the
application of the instrument at school and to a final discussion of the results.
39
Strand 11
METHODS
Test Instrument
The kinematics diagnostic test is designed for high school students at level K-10. The
test items are based on the list of concepts presented in the previous section. To every
concept there is also a set of corresponding misconceptions. The misconceptions have
been verified by asking the students open questions and by analyzing their answers.
Furthermore they have been confirmed by experts.
The test consists of 56 multiple-choice items on kinematics, each item containing one
right answer and three to four distractors. Every distractor has been chosen in a way
that it can be assigned to a single misconception. This is different from the other
kinematics tests mentioned before. Thus the test not only uncovers student concepts
but also student misconceptions. The items can be furthermore divided into three
levels of abstraction:
For all levels of abstraction a representative test item is presented in the appendix.
Prior to the explorative factor analysis we empirically verified the data set. In a first
step we sorted out all items with a difficulty above 0.85 or below 0.15 because items
with such high or low difficulties do not serve as good discriminators. As a second
step we determined the internal consistency calculating Cronbach's alpha for each
concept separately. We reviewed every item that did not contribute to the internal
consistency with respect to its content. Items that were considered capable of being
misunderstood or referring to multiple concepts were dropped.
Table 1
Distribution of the items. The stars mark items that refer to two concepts. Concepts 4
and 7 are completely excluded from further analysis.
Items
Level A
Level B
Level C
Total
Concept 1
2 4 5
51 53
35 36 40 46* 47*
10
Concept 2
6 7
28 42 46* 47*
Concept 3
8 9 10
[Concept 4]
Concept 5
11 12
Concept 6
13 14
54
[Concept 7]
Total
12
[16 29 37 43]
[4]
38 39 48* 49*
44 48* 49*
[18 45]
[2]
12
27
Through the whole verification process the data set was finally reduced from 56 to 27
items. The Table 1 shows the distribution of the remaining test items according to the
concepts and levels of abstractions. The numbers indicate the item numbers in the
40
Strand 11
test. The stars mark items that refer to two concepts. As the Cronbach's alpha
coefficients for the concepts 4 and 7 are below 0.30 these concepts are excluded from
further analysis. The Cronbach's alphas for the other concepts are between 0.60 and
0.80, the mean inter-item-correlations are between 0.21 and 0.41.
Collection of data
We collected the data from 56 students from classes of two teachers at two Swiss high
schools in autumn 2012. The average age of the participants was 16 years with a
standard deviation of 1 year and a range from 14 to 18 years. 30 participants were
female, 26 were male. About half of the students were majoring in economics, the
others in science and languages. Independent of their major subject all of the students
attended a similar basic kinematics course over about six weeks. The test was
presented online at the end of the instruction. The order of items was the same for all
students. They were required to complete the survey and no item could be skipped.
The time to answer the items as well as the time to complete the test was recorded
individually. The average overall time for completing the 56 items was (46 8) min.
41
Strand 11
we decided to make the analysis step by step. We first conducted the analysis for the
data of different abstraction levels A, B and C separately. Moreover we left out the
items 46-49 which refer to two concepts. This way the number of items was reduced
to 12, 3 and 8 for level A, B and C, respectively. Afterwards we checked if the results
for the different levels were compatible. In order to check if the set of items was
applicable to an exploratory factor analysis we calculated the Kaiser-Meyer-Olkincoefficients (Cureton & DAgostino, 1983). The standard rule is that the KMOcoefficient should be at least above 0.60, for good results yet above 0.80. Our values
ranged from 0.65 to 0.77.
Figure 1. Scree Plot. The eigenvalues of the Pearson correlation matrix are depicted
in decreasing order. The knee is between the factors three and four. This suggests a
three-factor model.
42
Strand 11
seem to play only a limited role. Much more relevant for answering the items
correctly is the understanding of the mathematical concepts of rate and vector
(including direction and addition). It is therefore tempting to interpret the underlying
factors as "rate concept", "direction concept" and "vector addition concept". The three
factors are only marginally correlated meaning that we have three almost independent
factors. The fact that the correlation is the strongest between the factors 2 and 3 is in
line with our interpretation. These factors both refer to a vector concept whereas
factor 1 refers to a rate concept.
Level B items
Level B (tables of values) only contains three items. A factor analysis indicates that a
single factor may be taken as underlying student responses. The factor explains 50 %
of the variation in the data. The loadings of the items 51, 53 and 54 on the factor are
0.85, 0.69 and 0.56. We note that all the items have high loadings on the factor.
However, the loading of item 54 is the lowest. The items 51, 53 and 54 are related to
the rate concepts C1 and C5 (see Tab. 1).
Again there seems to be an underlying "rate concept" which can explain a notable part
of the correlation of the items 51, 53, 54. The fact that item 54 has a lower loading
may be due to its different content. While the items 51 and 53 are about velocity, item
54 polls student understanding of acceleration.
Level C items
Considering the scree-plot, we used a two-factor model for the data from the level C
items. The two factors account for 47 % of the variance in the data. All items can be
clearly assigned to one of the underlying factors. The factor loadings range from 0.29
to 0.96. The correlation coefficient between the factors is 0.301.
We find again that the items corresponding to the rate concepts C1 and C5 group into
one factor whereas the items linked to the direction concepts C2 and C6 group into
another one. As for solving the items with stroboscopic pictures (level A) also for
solving the diagram items (level C) there seem to be two underlying factors that may
be interpreted as a rate concept on one side and a direction concept on the other
side. Of course it is not clear if the factors found in the two different levels A and C
are actually the same. But again the understanding of the two basic mathematical
concepts of rate and direction seems to be crucial for the interpretation of diagrams in
kinematics. The correlation coefficient between the factors is again small indicating
that the two factors are mostly independent of each other.
Overall result
The interesting issue is whether the "rate factors and the direction factors found at
different abstraction levels are correlated: Are these two factors universal for solving
problems in kinematics? In order to investigate this issue we carried out a factor
analysis including all items, which loaded on these two factors at levels A, B and C.
The result of this analysis is shown in Table 2. Four factors were detected explaining
50.0 % of the total variance in the data set. It is common practice to accept loadings
43
Strand 11
above 0.3 as indicating a relevant correlation between a particular item and the
underlying factor (Kline, 1994). Therefore and for better clarity, absolute values
below 0.3 are either hidden or put in brackets, if they are important for interpretation.
The first factor groups together the items from level A and B corresponding to the rate
concepts C1 and C5. With exception of item 12, which also loads on factor 3, the
loadings are all between 0.59 and .99 meaning that these items have a high correlation
with the underlying factor. The second factor mainly groups the items from level A
and B, which refer to the direction concepts C2 and C5. However, item 7 loads on all
the factors and cannot be assigned clearly to one factor. The factors 3 and 4 group the
items of level C. Again there is a tendency that the items corresponding to the rate
concepts contribute to one factor whereas the items referring to the direction concepts
load on the other factor. The highest factor correlation is between the factors 2 and 3
with a value of 0.42. The other correlations are below 0.3.
Table 2
Factor loadings for all factor 1 and factor 2 items of the levels A-C.
Level
Item
Factor
1
Corresponding Concept
3
.78
.68
.99
11
.60
12
[.27]
51
.70
53
.71
54
.59
13
.69
14
.69
35
36
40
38
.72
39
.92
28
.30
42
.98
44
.47
[-.27]
.35
[-.21]
[.08]
[.15]
[.13]
[.18]
.33
C6: Acceleration as vector
.40
[.24]
.67
.37
[-.03]
[.28]
[-.16]
C1: Velocity as rate
[.13]
C5: Acceleration as rate
C2: Velocity as vector
C6: Acceleration as vector
44
Strand 11
The main observation is that we have different factors for level A/B and level C items.
Obviously, from the students point of view the interpretation of diagrams differs from
the interpretation of stroboscopic pictures and tables. There is no direct transfer
between these two representations of motion. Therefore instead of having two
universal rate and direction factors we have to distinguish between the levels of
abstraction or, in other words, between the different representations. Overall there
seem to be five different underlying factors that are determining the correct answering
of the items. We suggest interpreting the factors as follows:
There are some details in the results that need to be discussed. First item 12 does not
mainly load on factor 1. There is no indication that the item differs from the other
factor 1 items as regards form and content. A possible reason is the high difficulty of
.80. As discussed before high difficulties usually lead to smaller correlations, in
particular when the sample size is rather small. Also item 7 does not fit well into our
suggested 5-factor-model. Obviously the integration of the level C items into the
factor analysis slightly changes the factor axes such that the loading of item 7 on the
factor 2 is lowered. There is no obvious reason why item 7 loads on the factors linked
to the diagrams. We have to recall that the sample is actually to small for the number
of items included in the present factor analysis such that the values have to be
interpreted with caution. Finally on level C we have the items 35 and 36, which do not
only load on the rate factor anymore but also on the direction factor. This fact is
actually due to item 40. After removing that item from the analysis we discovered an
increase of the loadings of items 35 and 36 on the rate factor. This shows again that
the factor analysis is very sensitive to small changes when the number of items is big
compared to the sample size. The loadings of the items 35, 36 and 40 on both the
factors 2 and 3 are also the cause for the noted correlation between the factors 2 and 3.
There is no obvious reason for this correlation from a theoretical point of view.
At last we investigated how the items 46 49, which can be linked to both the rate
concept and the direction concept, fit into our 5-factor-model. All of these items
contain a given kinematics graph (e.g a velocity-time diagram). The student then has
to select another corresponding diagram (e.g. a position-time diagram). We integrated
the items one by one to check which factor they load on while the factor axes are not
changed too much. We found that all these items load on both the factors 3 and 4 with
values above 0.3. This is an important finding as it shows that also the answering to
items that are referring to more than one concept can be explained within our 5-factormodel. There is no indication that new factors emerge for more complex problems.
APPLICATION
We suggest integrating the present test in the basic kinematics course in a formative
way. The test provides a detailed feedback for the students as well as for the teacher.
For every student, two diagrams can be prepared, one illustrating the percentages of
items solved correctly for each of the seven concepts and the other showing which
misconceptions are still present. The teacher gets feedback about the overall
45
Strand 11
performance of the class. Furthermore by means of a latent class analysis (LCA) the
teacher can find groups of students with similar concept profiles (Collins & Lanza,
2010). This allows the teacher to prepare customized materials for the groups such
that the students can work on their individual deficits having the chance to catch up.
For better illustration we performed a LCA with help of the program MPlus (2011).
We included the data of the 27 items shown in Table 1. In order to determine the
optimal number of classes we used a technique similar to the one used for the factors.
Instead of plotting the eigenvalues, we plotted the loglikelihood against the number of
classes. By locating the knee in the graph we found four different classes that can be
assigned to four groups of students. The characteristics of the four groups are shown
in Figure 2. The mean score is defined as the group average of the fraction of
correctly solved items corresponding to the particular concept. Even if we did not
include the items referring to the concepts C4 and C7 in the LCA, we plotted the
mean scores for completeness. The four groups can be characterized as follows:
46
Strand 11
DISCUSSION
We have found that there are two basic mathematical concepts that are crucial for the
understanding of kinematics: the concept of rate and the concept of vector (including
direction and addition). The context and the content seem to play only a minor role. If
a student understands the concept of rate he is able to answer correctly to questions
about velocity and acceleration in different contexts. The same holds for the vector
concept. This result has direct implications for the instruction. It suggests that in
kinematics courses the focus should be first on the learning of the mathematical
concepts. Transferring the mathematical concepts to physical contents and applying
them in different contexts is suggested to be easier for students than learning physical
concepts without a mathematical fundament. These findings are somewhat in line
with the results of Christensen and Thompson (2012) who investigated the graphical
representations of slope and derivative among third-semester students. In the
conclusion they stated, that some of their demonstrated difficulties [in physics] seem
to have origins in the understanding of the math concepts themselves. Moreover also
Bassok and Holyoak (1989) found similar results analyzing the interdomain transfer
between isomorphic topics in physics and algebra. Students who had learned
arithmetic progressions were very likely to spontaneously recognize the application of
the algebraic methods in kinematics. In contrast, students who had learned the physics
topic first almost never exhibited any detectable transfer to the isomorphic algebra
problems. Finally, it has to be mentioned, that even if the understanding of the
mathematical concepts seems to be a requirement for understanding kinematics, it
does not guarantee success (Planinic, Ivanjek and Sussac, 2013).
Another interesting finding is that the expert associations of items corresponding to
the concepts C4 and C7 could not be found in the student answers. These items
involve the evaluation of areas under the curve. Obviously most of the student did not
have proper area concepts. Instead of that, interviews showed that students often
argued with a concept of average. For example when they were asked to interpret the
velocity-time-diagram of an object regarding to its covered distance, they often did
not consider the area under the curve but tried to estimate the mean velocity. From a
mathematical point of view, finding the mean value is equivalent to determining the
area under the curve and dividing by the interval size. Still, the interviews indicated
that the use of an average concept is accompanied by different misconceptions than
the use of the area concept. All in all the items corresponding to concept C7 were the
most difficult of the test. This can be seen in Figure 2. These results are in line with
the findings of Planinic, Ivanjek and Susac (2013). They also found that the slope
concept (which we call the rate concept) could be easily transferred from
mathematical to physical contexts. However, this is not the case for the area under the
graph concept. The transfer of this concept from mathematics to physics was found to
be much more difficult for the students. A possible reason could be the fact that
during the teaching of kinematics the interpretation of the slope is usually emphasized
much more than the interpretation of the area under the graph.
As the kinematics test used in this study contains 27 items, a minimum number of 270
students is needed to produce a reliable result by means of a factor analysis. As we do
not meet this requirement (N = 56), the present results are preliminary. Still, the fact
that the association of items given by the assignment to the concepts by experts could
be clearly found in the student answers is very promising. Furthermore most of the
results in this study confirm results from other studies. This gives rise to hope that the
results will be corroborated in a following study with a bigger sample size.
47
Strand 11
REFERENCES
Bassok, M. & Holyoak, K. J. (1989). Interdomain Transfer Between Isomorphic
topics in Algebra and Physics. Journal of Experimental Psychology: Learning,
Memory, and Cognition 15 (1): 153-166.
Beichner, J. R. (1993). Testing student interpretation of kinematics graphs. Am. J.
Phys. 62 (8): 750-762.
Bortz, J. (1999). Statistik fr Sozialwissenschaftler (5. Aufl.). Berlin: Springer.
Bhner, M. (2011). Einfhrung in die Test- und Fragebogenkonstruktion (3. Aufl.).
Mchen: Pearson.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behav.
Res. 1, 245.
Christiansen, W. M. & Thomsen, J. R. (2012). Investigating graphical representations
of slope and derivative without a physics context. Phys. Rev. ST Phys. Educ.
Res. 8, 023101.
Collins L. M. & Lanza S. T. (2010). Latent Class and Latent Transition Analysis.
Edited by Balding, D. J., Cressie, N. A. C, Fitzmaurice, G. M., Johnstone, I.
M., Molenberghs, G., Scott, D. W., Smith, A. F. M., Tsay, R., S., Weisborg, S.
of Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley &
Sons, Inc.
Cureton, E. E. & DAgostino, R. B. (1983). Factor analysis: an applied approach.
Hillside, NJ: Lawrence Erlbaum Associates.
Everitt, B. S. (1975). Multivariate analysis: The need for data, and other problems. Br.
J. Psychiatry 126, 237.
Hake, R. R. (1998). Interactive-engagement versus traditional methods: A sixthousand-student survey of mechanics test data for introductory physics
courses. Am. J. Phys. 66 (1): 64-74.
Halloun, I. A. & Hestenes, D. (1985). The initial knowledge state of college physics
students. Am. J. Phys. 53 (11): 1043 -1055.
Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The
Physics Teacher, 30, 141-158.
Hestenes, D. & Wells, M. (1992). A mechanics baseline test. Phys. Teach. 30, 159.
Kline, P. (1994). An Easy Guide to Factor Analysis. London: Routledge.
Merrifield, P. R. (1974). Factor analysis in educational research. Rev. Res. Educ. 2,
393.
MPlus (2011). MPlus Version 6.11, Muthen & Muthen. Available from
www.statmodel.com (Dec 16, 2013).
Planinic, M., Ivanjek, L. & Susac, A. (2013). Comparison of university students
understanding of graphs in different contexts. Phys. Rev. ST Phys. Educ. Res.
9, 020103
Scott, T. F. & Schumayer, D. (2012). Exploratory factor analysis of a Force Concept
Inventory data set. Phys. Rev. ST Phys. Educ. Res. 8, 020105.
SPSS (2010). IBM SPSS Statistics, Version 19.0.0 for Mac, SPSS Inc.
48
Strand 11
APPENDIX
Example 1: Item 14 (Level A, concept C6: acceleration as vector)
A helicopter is approaching
for a landing. It moves
vertically downwards and
reduces its velocity.
Which of the following Which
statements
describes
acceleration
of the
of the
following the
statements
describes
thehelicopter best?
1. The acceleration is zero.
2. The acceleration points downwards.
3. The acceleration points upwards.
4. The direction of the acceleration is not defined
5. The acceleration has no direction.
Niveau:(4(
Quelle:(vgl.(Beichner(3(
Two bodies are moving on a straight line. The positions of the bodies at successive 0.2+(
second
time intervals are represented in the table below.
(
(
Time
in s
(
(
Body 1: Position in m
(
Body 2: Position in m
0.0
0.2
0.0
0.2
0.4
0.4
0.4
0.7
0.8
0.6
1.1
1.2
0.8
1.6
1.6
1.0
2.2
2.0
1.2
2.9
2.4
1.4
3.7
2.8
Quelle:(vgl.(Beichner(3(
Example
3: Item 28 (Level C, concept C2: velocity as vector)
Niveau:(4(
(
(
s"t
The following represents
a position-time graph (x-t-diagram) for an object.
x&
+(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
t&
49
Strand 11
THEORETICAL FRAMEWORK
Science education standards emphasize the importance of experimental skills for
scientific literacy (e.g., KMK, 2005; NRC, 2012). Students abilities to plan and
carry out experimental investigations are included in evaluations of national
standards as well as in international student assessments (OECD, 2007). Theories
of the experimental process typically distinguish between three phases of
experimenting: preparation (e.g. planning experimental procedures), performance
(e.g. setting up the apparatus) and evaluation (e.g. interpreting results) (cf. Emden,
2011). A test instrument measuring experimental skills should cover all the
three phases.
50
Strand 11
Testing experimental skills has to address several problems, especially in largescale assessments. A process-based assessment, analyzing students actions during
hands-on experiments, is resource-consuming. Supplying standardized apparatus
for hands-on tests poses problems of logistics. Paper and pencil tests can hardly
cover experimental skills of the performance phase. Thus, paper-pencil tests are
often narrowed to the preparation of experiments and the evaluation of data (e.g.,
Glug, 2009).
Previous studies on the exchangeability of test formats for experimental skills show
only low correlations between students achievements in paper and pencil tests and
their achievements in hands-on experiments (e.g., Shavelson, Ruiz-Primo, &
Wiley, 1999; Stecher et al., 2000; Hammann et al., 2008; Emden, 2011;
Schreiber, 2012). On the other hand, studies indicate that computer-simulations
might be valid substitutes for hands-on experiments in tests (cf. Shavelson et al.,
1999; Schreiber, 2012).
Schreiber (2012) found no significant difference between the distributions of
achievement scores gained from computer-based testing with mouse-on
experiments and hands-on testing, whereas the distributions differed significantly
between a paper and pencil test and a hands-on test. Schreiber (2012) also found
that broad experimental tasks posed in an open format cause a high dropout rate
during the test. This leads to ground-effects and missing data.
Test instrument
The test instrument refers to typical experimental tasks in secondary school
physics instruction. The target group are students at the end of lower secondary
education (aged 14 to 16). The test instrument consists of several units. Each unit
deals with a specific experimental task. The students have to perform a complete
experimental investigation, i.e. plan the experiment, prepare the setup, perform
the measurements, analyze experimental data and draw conclusions. To
minimize drop-out caused by comprehensive experimental tasks (Schreiber,
2012), each unit is split up into a sequence of items each referring to one
experimental skill (e. g. plan the experiment or perform the measurements),
Furthermore, each item starts with a sample solution of the preceding item. Thus,
students experimental skills can be assessed across the full range of the phases of
an experimental investigation. For instance, students who do not succeed in
assembling an appropriate experimental set-up can still proceed with the
measurement item, because it provides them with a functional set-up. The
51
Strand 11
preparation
performance
evaluation
process data
specify procedure
select from
a given
set of
apparatus
sketch the
set-up
describe
the course
of action
draw conclusions
52
Strand 11
Figure 2: Sample item perform and document measurements. Pressing the green
button Your task will show the hypothesis which has to be tested. Explanations for
technical terms can be found using the yellow button Support.
Validation Studies
The development of a new test instrument and a new test format has to be
underpinned by extensive validation studies. Our validation studies include
content analysis, an analysis of students individual solution strategies, the analysis
of the relationship with external variables and the analysis of the internal test
structure (Wilhelm & Kunina, 2009). Table 1 shows an overview of the research
questions and the methods used for each validation aspect.
In this paper we focus on the content and the cognitive aspects of validation.
53
Strand 11
Table 1
Validation aspects, corresponding research questions and studies to answer the
research questions
Validation
Aspects
Content
Individual
strategies
(cognitive
processes)
Relationship
with external
variables
Internal test
structure
Research questions
Do the tasks represent experiments that students
are likely to have seen or worked on?
Are the tasks consistent with typical demands
posed in classroom practices of experimenting?
Do experimental considerations dominate in
students thinking while working on the tasks?
Do the tasks offer adequate support to compensate
for deficits in physics content knowledge?
Is the cognitive load of mouse-on experimenting
comparable to a hands-on test format?
Is the mouse-on test performance a good predictor
for performance in hands-on tests?
Studies
Analyses of syllabi
and schoolbooks;
expert ratings
Think aloud
(intro- and
retrospective)
Comparative studies
in the science
education lab
mouse-on vs.
hands-on
Large-scale
(400 students per
unit)
CONTENT ANALYSIS
Methods
Syllabi and schoolbooks were initially analyzed to identify physics content areas
and experimental challenges that are in accordance with aims and practices of
physics instruction (content validity). The analysis was done in two steps. In
the first step, key terms were identified in an inductive process by going
through curricula and schoolbooks. To ensure comparability across the syllabi of
the 16 German federal states, similar terms were clustered in 35 term groups.
The term group electrical resistance for example includes the terms electrical
resistance, specific resistance, I-U characteristics, Ohms law, and electrical
conductivity. The quality of this method was verified for the content areas
mechanics, optics, electricity, and thermodynamics. The inter-rater reliability
(Cohens kappa) of assigning terms to term groups is at least satisfactory (.78 <
< .95). In the second step, a criteria-based investigation of the 16 syllabi and of
selected schoolbooks was carried out. The curricula and schoolbooks were
searched for the term groups, differentiating between general occurrence and
occurrence in (explicit) conjunction with an experimental action (preparation,
performance, evaluation). In addition, the syllabi we were analyzed with regard to
the grades in which a term group occurs and whether it is obligatory or optional
content. In the schoolbooks all the experimental tasks referring to a term group were
54
Strand 11
identified.
Based on these analyses 22 suggestions for typical experimental tasks were
generated. 53 experts (experienced teachers) rated to which extent these tasks
comply with typical demands posed in classroom practices of experimenting (fourlevel Likert-scales). This was done by an online-survey. We e.g. asked the experts
how likely students would have had appropriate learning opportunities, enabling
them to solve the task. We also asked the experts how likely students could plan,
perform or evaluate just this or a very similar experiment at the end of lower
secondary education.
Evaluating the online-survey, we ranked the experimental tasks for each content
area separately. Our main criterion was that, according to the experts
estimations, it is likely or very likely that the students can perform the experimental
task.
Results
Our syllabus analysis confirmed the central role of experiments in physic teaching
(cf. Tesch, 2005). The analysis yielded a high consistency of the obligatory
content (measured by the occurrences of the term groups) to be dealt with during
lower secondary physics instruction across the 16 federal states of Germany.
Minor differences were found with regard to the grade in which the content is
taught. Comparing the experiments presented in the schoolbooks we were able to
identify a consistent set of widely used topics for student experiments. For 12 of the
22 tasks our main criterion (M 3.0 on a scale from 1 to 4) was fulfilled. Especially
in the domains electric circuits and optics our tasks comply with typical
demands posed in classroom practices of experimenting. In mechanics the ratings
of two out of four tasks were not satisfactory. We thus developed two more tasks
for this domain.
As the result of our content analysis we can build on a set of twelve experimental
tasks with high content validity for the physics domains electric circuits,
geometrical optics and mechanics. The tasks provide a solid basis for the
investigation of further validation aspects, in particular cognitive validity. We have
designed twelve complete units around these tasks (together with the interactive
simulations).
COGNITIVE VALIDATION
Methods
For the aspect of cognitive validation we focus on the students cognitive
processes while working on the units. They key issue is whether experimental
considerations dominate in students thinking while they try to solve the items:
Are their actions driven by reflections on the experiment to be conducted or by other
aspects, like operating the simulation software? To answer this research question,
four out of twelve units from the three content areas are analyzed with thinkaloud techniques (intro- and retrospective). About 40 students worked on each unit.
The verbalizations of the students are rated in a deductive mode of qualitative
content analysis. We use indicators to distinguish between students considerations
that are related to the process of experimentation (e.g. safety issues,
55
Strand 11
measurement accuracy etc.) and considerations that are based on nonexperimental arguments (e.g. plausibility considerations). Table 2 shows examples
for the item assemble and test the experimental setup of the unit elongation of a
rubber band.
Table2
Examples of experimental and non-experimental considerations in the unit
Elongation of a rubber band
Experimental considerations
Non-experimental considerations
Results
The analysis of the cognitive processes for the item assemble and test the
experimental setup of the rubber band unit shows that most students dominantly
express experimental considerations (see figure 3) while working on this item.
Further analyses will show whether this result can be confirmed for other items and
units.
assemble experimental setup
42%
58%
experimental
non-experimental
78%
experimental
non-experimental
56
Strand 11
REFERENCES
Emden, M. (2011). Prozessorientierte Leistungsmessung des naturwissenschaftlichexperimentellen Arbeitens. Berlin: Logos.
Glug, I. (2009). Entwicklung und Validierung eines Multiple-Choice-Tests zur
Erfassung prozessbezogener naturwissenschaftlicher Grundbildung. ChristianAlbrechts- Universitt zu Kiel.
Hammann, M., Phan, T. T. H., Ehmer, M. & Grimm, T. (2008). Assessing pupils
skills in experimentation. Journal of Biological Education 42 (2), 66-72.
KMK. Sekretariat der Stndigen Konferenz der Kultusminister der Lnder in der
Bundesrepublik Deutschland. (2005). Bildungsstandards im Fach Physik fr
den Mittleren Schulabschluss. Mnchen: Luchterhand.
Nawrath, D., Maiseyenka, V., & Schecker, H. (2011). Experimentelle Kompetenz Ein Modell fr die Unterrichtspraxis. Praxis der Naturwissenschaften Physik in
der Schule, 60(6), 4249.
NRC. National Research Council. (2012). A Framework for K-12 Science
Education: Practices, Crosscutting Concepts, and Core Ideas. Washington, DC:
The National Academies Press.
OECD (ed.) (2007). PISA 2006 - Schulleistungen im internationalen Vergleich:
Naturwissen- schaftliche Kompetenzen fr die Welt von morgen. Bielefeld:
Bertelsmann.
Schreiber, N. (2012). Diagnostik experimenteller Kompetenz. Validierung
technologie- gesttzter Testverfahren im Rahmen eines
Kompetenzstrukturmodells. Berlin: Logos.
Schreiber, N., Theyen, H., & Schecker, H. (2012). Experimental Competencies in
science: a comparison of assessment tools. In C. Bruguire, A. Tiberghien, & P.
Clment (Eds.), E- Book Proceedings of the ESERA 2011 Conference: Science
57
Strand 11
58
Strand 11
INTRODUCTION
On a three years regular basis, the PISA surveys, launched by OECD in 2000, aim to
monitor the outcomes of education systems in terms of 15-years-old pupils
achievements. The PISA project is to implement educational goals in order to prepare
the young generation to a responsible citizen adult life. With this purpose, the
designers of PISA consider that their specific concept of literacy, namely the
capacity of students to extrapolate from what they have learned and to apply their
59
Strand 11
knowledge in novel settings (OECD, 2007, p. 3) is relevant not only for the basics
competences of reading and math, but also for a scientific literacy which is a necessity
in our scientific and technological society. This notion of scientific literacy is defined
as:
An individuals scientific knowledge and use of that knowledge to identify
questions, to acquire new knowledge, to explain scientific phenomena, and to
draw evidence based conclusions about science-related issues, understanding of
the characteristic features of science as a form of human knowledge and enquiry,
awareness of how science and technology shape our material, intellectual, and
cultural environments, and willingness to engage in science-related issues, and
with the ideas of science, as a reflective citizen (OECD, 2006, p. 8).
Scientific literacy should also contribute to develop and strengthen interest for science
and technology, in particular to counteract the widespread disaffection towards these
areas, a current problem in the western countries (Rocard et al., 2007). As this
problem is particularly pronounced for the physical sciences (Be, Henriksen, Lyons,
& Schreiner, 2011; Murphy & Whitelegg, 2006; Jenkins, 2006; Zwick & Renn,
2000), we focus on this domain in the following.
PISAs choice of units to evaluate the scientific knowledge of 15 year-old pupils had
to take account of the fact that the scientific curricula can be very different according
to countries, as well as regards the different schools disciplines (integrated science,
biology, physics, chemistry, geology, astronomy, etc.) and topics covered (for
instance electricity, optics or motion in physical sciences). Furthermore, PISA wants
to focalize less on the scientific knowledge of pupils than on their competences to
understand and solve scientific problems. With this background, PISA opted for
questions chosen in areas of application of science such as Health, Environment, or
Technology, which give rise to debates in the society and/or are in connection with
recent technological progress the consequences of which for society have to be
discussed.
A central issue for PISA is that these questions are authentic and motivating for young
people and it is this point of view that we analyze in this article, based on empirical
data of a survey with secondary level I pupils and teachers in Geneva. The
contribution is thus an extension of our preceding studies aimed to better understand
and qualify what PISA actually evaluates: a first paper on the comparison between the
PISA science units and the science curriculums of French speaking Switzerland
(Weiss, 2010) and a second one where the compatibility between the Inquiry Based
Learning (IBL) and the science PISA survey was discussed (Weiss, submitted).
The paper is organized as follows: after giving a short theoretical background about
the notion of authenticity, we describe the PISA choices for its units and items to be
authentic. We then proceed with a description of the three released PISA units chosen
for our survey, of the sample, and the instruments of the study. Results from the
pupils and teachers sample about their perception of these PISA units will be
discussed and compared, with each other, and with another study on authentic
60
Strand 11
learning (Kuhn, 2010). Finally, several conclusions about classroom implications and
future research will be discussed.
61
Strand 11
62
Strand 11
Pupils sample
Perception of these units by pupils was tested within a sample of fourteen 8th and 9th
grade classes in lower secondary school in Geneva in June 2011. The collected data
concern: 151 pupils (70 girls and 76 boys, 5 gender not mentioned) from ten 9th grade
classes (118 pupils) and four 8th grade classes (33 pupils)1, distributed in ten high
educational level classes (A level, 129 pupils) and four lower achieving level classes
(B level, 22 pupils). These classes belong to four lower secondary schools and are
taught by six physics teachers.
Teacher sample
A panel of 20 persons involved in secondary school and/or in teachers pre-service
training was investigated about the same questions as pupils (see below). These
persons were two university teachers, six teachers trainers, who teach themselves in
the secondary school and twelve young teachers at the end of their pre-service (having
already their own classes). Further we refer to them as teachers.
Instruments
Motivational variables were assessed with an instrument well established in the
literature on science motivation (adapted from Hoffmann et al., 1997; total
Cronbachs C=.93) with the following subscales: Intrinsic interest (IE: C=0.89),
reality connection/authenticity (RA: C=0.95) and self-efficacy /self-concept (SC:
C=0.89); for details see Kuhn (2010). The instrument was translated in French and
adapted to the particular situation of a survey without an actual teaching with the
PISA units. Pupils had to evaluate the authenticity and the interest of three PISA units
by reading them without having to answer to the items (nevertheless some pupils did
it). The questions were about the connection of the PISA units to the out of school
life, the utility of solving them for our society, the pupils desire to learn more and to
speak with friends about the question, the pupils perception that he or she would be
effective in learning physics through these questions. In this questionnaire, RA and IE
were assessed each through 7 items, SC through 10 items.
A similar but shorter questionnaire was prepared for teachers, without SC questions.
These teachers answered about five PISA units, the same three of the pupils and two
more to verify if the results could be generalized to other PISA units.
For the results given below, motivation test scores on each sub-dimension are given as
percentage relative to the maximal possible value.
63
Strand 11
RESULTS
Pupils perceptions
Data show that pupils do perceive the PISA units as not realistic as RA<50% and
even less interesting with IE40% (Figure 1. Pupils perceptions about three PISA units.
RA measures reality connection/authenticity, IE the intrinsic interest and SC the selfconcept. Mot is the sum of the 3 dimensions).
Pupils perceptions
100%
80%
Sunscreens
60%
Greenhouse
40%
Clothes
20%
PISA (3 units)
0%
RA
IE
SC
Mot
Teachers perceptions
The teachers perception of the motivational features of PISA items lies considerably
above that of pupils as shown in Figure 2. Teachers perceptions about five PISA units (the
three ones evaluated by pupils plus Grand Canyon and Acid rain). RA measures reality
connection/authenticity, IE the intrinsic interest and Mot(2 dim) is the sum of the 2
dimensions.
64
Strand 11
Teachers perceptions
Sunscreens
100%
Greenhouse
80%
Clothes
60%
PISA (3 units)
40%
Grand Canyon
20%
Acid rain
0%
RA
IE
Mot (2 dim)
PISA (5 units)
Figure 2. Teachers perceptions about five PISA units (the three ones evaluated by
pupils plus Grand Canyon and Acid rain). RA measures reality
connection/authenticity, IE the intrinsic interest and Mot(2 dim) is the sum of the 2
dimensions.
The differences between teachers and pupils perceptions about the PISA units are all
statistically significant at the level p < 0.001 (apart for Clothes, where no significant
differences were found) and the effect sizes are high as shown in Table 1
Effect sizes (Cohen d) of teachers and pupils perceptions differences. The significance level of
all differences is p < .001.
Table 1
Effect sizes (Cohen d) of teachers and pupils perceptions differences. The
significance level of all differences is p < .001.
RA
IE
Sunscreens
1.24
1.72
Greenhouse
1.47
1.87
PISA (3 units)
1.15
1.43
65
Strand 11
REFERENCES
Bennett, J., Lubben, F., Hogarth, S. (2007). Bringing Science to Life: A Synthesis of
the Research Evidence on the Effects of Context-Based and STS Approaches to
Science Teaching. Science Education, 91(3), 347-370.
Be, M.V., Henriksen, E.K., Lyons, T. & Schreiner, C. (2011). Participation in
Science and Technology: Young peoples achievement-related choices in late
modern societies. Studies in Science Education, 47 (1), 37-72.
Fensham, P. J. (2009). Real world contexts in PISA science: Implications for contextbased science education. Journal of Research in Science Teaching (46) 884896
66
Strand 11
Hattie, A.C. (2009). Visible Learning. A synthesis of over 800 meta-analyses relating
to achievement. London, New York: Routledge.
Hoffmann, L., Huler, P. & Peters-Haft, S. (1997). An den Interessen von Mdchen
und Jungen orientierter Physikunterricht. Ergebnisse eines BLKModellversuches. Kiel: IPN.
Jenkins, E.W. (2006). The Student Voice and School Science Education, Studies in
Science Education, 42, 49-88.
Kuhn, J. (2010). Authentische Aufgaben im theoretischen Rahmen von Instruktionsund Lehr-Lern-Forschung: Effektivitt und Optimierung von Ankermedien fr
eine neue Aufgabenkultur im Physikunterricht. Wiesbaden: Vieweg + Teubner.
Murphy, P. & Whitelegg, E. (2006). Girls in the physics classroom: a review of the
research on the participation of girls in physics. Institute of Physics, London,
UK. http://oro.open.ac.uk/6499/, access 20/05/2013.
OECD. (2006). Assessing scientific, reading and mathematical literacy: A framework
for PISA 2006. Paris: OECD.
OECD. (2007). PISA 2006: Science competencies for tomorrows world. Volume 1:
Analysis. Paris: OECD.
Rocard, M., Csermely, P., Jorde, D., Lenzen, D., Walberg-Henriksson, H. & Hemmo,
V. (2007). Science education now: A Renewed Pedagogy for the Future of
Europe. Bruxelles: EU. Directorate-General for Research, Science, Economy and
Society.
Shaffer, D. W., Resnick, M. (1999). "Thick" Authenticity: New Media and Authentic
Learning. Journal of Interactive Learning Research, 10(2), 195-215.
Weiss, L. (submitted). PISA-sciences est-il IBL-compatible? Recherches en
didactique.
Weiss, L. (2010). Lenseignement des sciences au secondaire obligatoire en Suisse
romande, au regard des enqutes internationales sur la culture scientifique des
jeunes. Revue Suisse des Sciences de l'Education, 32, /3, 393-420.
Zwick, M. & Renn, O. (2000). Die Attraktivitt von technischen und
ingenieurwissenschaftlichen Fchern bei der Studien- und Berufswahl junger
Frauen und Mnner. Stuttgart: Akademie fr Technikfolgenabschtzung.
1
67
Strand 11
68
Strand 11
Hughes, & Mylonas, 2002). Additionally few science education studies have focused
on peer assessment (Crane & Winterbottom, 2008), especially at the primary (Harlen,
2007) and secondary school levels (Tsivitanidou, Zacharias, & Hovardas, 2011).
Consequently, we do not have a thorough picture of what primary and secondary
school students can do in a peer assessment context, especially in terms of the
heuristics that secondary school students use when revising their science webportfolios based on peer and expert feedback received. Such evidence is essential,
since peer assessment is gaining grounds in participative inquiry-based science
learning environments, especially computer-supported inquiry learning environments
(e.g., de Jong et al., 2010; 2012).
METHOD
Participants
Participants were 28 seventh graders (14 year-olds), coming from two different
classes (Nclass1 = 14 and Nclass2 = 14) of a public school (Gymnasium) in Nicosia,
Cyprus. Participants in the study were guaranteed anonymity and that it would not
contribute to students final grade. All students had prior experience with reciprocal
peer assessment, since they had participated in reciprocal peer assessments of webportfolios whose content came from an environmental science context, similar to (but
not the same as) the one involved in this study.
Material
Students studied web-based material that were developed for the purposes of the SCY
(Science Created by You) project (de Jong et al, 2010) and concerned the construction
of CO2-friendly houses, namely, houses made with specific modifications during the
building and operation phases in order to produce lower CO2 emissions than
conventional houses. This learning material required from students to create a number
of learner artifacts (e.g., concept maps, tables, text) which were included in the
students' web-portfolios. For the peer-assessment purposes students worked with
Stochasmos, which is a web-based learning platform that supports collaborative
learning in an inquiry-based environment (Kyza, Michael, & Constantinou, 2007). We
chose Stochasmos because the participants were already familiar with it and it had the
69
Strand 11
Procedure
Each student group used a computer and the web-based platform to access the
curriculum material, follow the activity sequence and complete the accompanying
tasks. Each of these tasks corresponded to the development of a learner product which
was included in the students web-portfolio. Each home group created nine artifacts.
We chose a reciprocal peer assessment approach and employed an online and
anonymous peer assessment format. Participants worked in groups of two (home
group) while developing learner artifacts (see figure 1). However, they carried out the
role of peer assessor on an individual basis.
After all students completed all tasks, peer assessors could access the web-portfolio of
the peer group they were to assess, which was randomly assigned to them. Each webportfolio (all the learner artifacts included in a science web-portfolio) was assessed by
two peers from the same home group, who worked on different computers (see figure
1). Each web-portfolio was also assessed by an expert.
70
Strand 11
RESULTS
Screen captured data analysis allowed for the identification of student behavioral
patterns (responses/actions of peer assessees) during the feedback review and revision
of web-portfolios phase. By comparing the time-line graphs of peer assessees actions
we identified four different patterns/profiles (see Figure 2).
The first example presents peer assessees who studied both the expert and peer
feedback, but used the feedback that they had produced as peer assessors to filter it,
before making any change. The second example presents peer assessees who read
both the expert and peer feedback and considered the expert feedback to be more
valuable, but in the end did not make any changes. The third example presents peer
assessees who made changes while taking into consideration both the expert and peer
feedback. The fourth example presents peer assessees who quickly scanned both the
expert and peer feedback and then concentrated only on the expert feedback, which
they partially filtered through the use of the feedback they produced for others as peer
assessors before they proceeded with making changes to their learning products.
71
Strand 11
Figure 2. Four representative graphs of peer assessee actions over time. Graph 1 is for peer assessee group 3 (Konstantinos and Yannis),
Graph 2 is for group 7 (Tonia and Marcos), Graph 3 is for group 10 (Despina and Maria), and Graph 4 is for group 2 (Sotia and Nikos).
The y-axis represents peer assessee actions/responses and the x-axis gives the time in seconds. The codes for the y/axis are as follows: (1)
Reading expert feedback; (2) Reading peer feedback A; (3) Reading peer feedback B; (4) Reading own feedback A; (5) Reading own
feedback B; (6) Revisiting own learner products; (7) Revising own learner products; (8) Opening Stochasmos learning environment; (9)
Opening own web-portfolio; (10) Using software other than Stochasmos (e.g., MS Office); (11) Discussion between group members; (12)
Revisiting primary information resources from Stochasmos platform; (13) Browsing for information on the web.
72
Strand 11
Despite the differences among these four examples, there are similarities that lead to
important conclusions about the enactment of peer assessment:
(a) All assessee groups read both the peer and expert feedback, focusing on
negative/critical judgments. Peer assessees tended to skip the positive judgments or to
spend less time on reading them and focused primarily on the negative judgments.
(b) Many assessee groups tended to access the feedback they themselves had
produced as peer assessors after they went through expert and peer feedback and used
it as a point of reference. In most cases where assessee groups accessed their own
feedback, the changes that were actually adopted were related to the content of
science portfolios. In fact, half of the assessee groups accessed the feedback they
themselves had produced as peer assessors after they went through the expert and peer
feedback (e.g., graphs 1 and 4).
The analysis showed that the more time devoted to reviewing own feedback the
greater the number of changes eventually adopted by assessee groups (Kendalls tau b
= 0.47; p < 0.05). This result reveals the crucial role of a hidden factor, namely
own feedback, whose effect was not known to us at the beginning of this study
because it was not reported elsewhere in the literature of the domain. Surprisingly,
many students (50% of the peer assessee groups) were found to adopt changes,
including science content related changes, which overlapped all three types of
feedback. Also, comparison of the peer suggested changes that were made to those
suggested by the expert revealed that students were comparing the peer and expert
feedback and made actual changes only when the suggested changes overlapped.
Finally, the number of changes eventually adopted by assessee groups amounted to
slightly more than one-fifth of those recommended (42 out of 186, 22.58%). Changes
adopted referred to either the science content (23 changes) or the
appearance/organization (19 changes) of web-portfolios.
73
Strand 11
REFERENCES
Ballantyne, R., Hughes, K. & Mylonas, A. (2002). Developing procedures for
implementing peer assessment in large classes using an action research process.
Assessment & Evaluation in Higher Education, 27(5), 427-441.
Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment.
Educational Assessment, Evaluation and Accountability (formerly: Journal of
Personnel Evaluation in Education), 21(1), 5-31.
Crane, L., & Winterbottom, M. (2008). Plants and photosynthesis: peer assessment to
help students learn. Journal of Biological Education, 42, 150-156.
De Jong, T., Van Joolingen, W., Giemza, A., Girault, I., Hoppe, U., Kindermann, J.,
Kluge, A., Lazonder, A., Vold, V., Weinberger, A., Weinbrenner, S.,
Wichmann, A., Anjewierden, A., Bodin, M., Bollen, L., dHam, C., Dolonen,
J., Engler, J., Geraedts, C., Grosskreutz, H., Hovardas, T., Julien, R., Lechner,
J., Ludvigsen, S., Matteman, Y., Meistadt5, ., Nss, B., Ney, M., Pedaste,
M., Perritano, A., Rinket, M., von Schlanbusch, H., Sarapuu, T., FSchulz, F.,
Sikken1, J., Slotta, J., Toussaint., J., Verkade, A., Wajeman, C., Wasson, B.,
Zacharia, Z., van der Zanden, M. (2010). Learning by creating and
exchanging objects: The SCY experience. British Journal of Educational
Technology, 41 (6), 909-921.
De Jong, T., Weinberger, A., Van Joolingen, W., Giemza, A., Girault, I., Hoppe, U.,
Kindermann, J., Kluge, A., Lazonder, A., Vold, V., Weinberger, A.,
Weinbrenner, S., Wichmann, A., Anjewierden, A., Bodin, M., Bollen, L.,
dHam, C., Dolonen, J., Engler, J., Geraedts, C., Grosskreutz, H., Hovardas,
T., Julien, R., Lechner, J., Ludvigsen, S., Matteman, Y., Meistadt5, ., Nss,
B., Ney, M., Pedaste, M., Perritano, A., Rinket, M., von Schlanbusch, H.,
74
Strand 11
Sarapuu, T., FSchulz, F., Sikken1, J., Slotta, J., Toussaint., J., Verkade, A.,
Wajeman, C., Wasson, B., Zacharia, Z., van der Zanden, M. (2012). Using
scenarios to design complex technology-enhanced learning environments.
Education, Technology, Research & Development, 60, 883-901.
Gielen, S., Peeters, E., Dochy, F., Onghena, P., & Struyven, K. (2010). Improving the
effectiveness of peer feedback for learning. Learning and Instruction, 20, 304315.
Harlen, W. (2007). Holding up a mirror to classroom practice. Primary Science
Review, 100, 2931.
Kyza, E., Michael, G., & Constantinou, C. (2007). The rationale, design, and
implementation of a web-based inquiry learning environment. In C.
Constantinou, Z. C. Zacharia, & M. Papaevripidou (Eds.), Contemporary
perspectives on new technologies in science and education, Proceedings of the
Eighth International Conference on Computer Based Learning in Science (pp.
531-539). Crete, Greece: E-media.
Sluijsmans, D. M. A. (2002). Student involvement in assessment, the training of peerassessment skills. Interuniversity Centre for Educational Research.
Topping, K. (1998). Peer assessment between students in colleges and universities.
Review of Educational Research, 68, 249276.
Tsivitanidou, O., Zacharia, Z. C., & Hovardas, A. (2011). High school students
unmediated potential to assess peers: unstructured and reciprocal peer
assessment of web- portfolios in a science course. Learning and Instruction, 21,
506519.
Van Gennip, N. A. E., Segers, M. S. R., & Tillema, H. H. (2010). Peer assessment as
a collaborative learning activity: the role of interpersonal variables and
conceptions. Learning and Instruction, 20, 280-290.
75
Strand 11
76
Strand 11
RATIONALE
To know more precisely what is effectively assessed, we need to understand what makes
a test question difficult. Indeed, if we are not able to understand why some questions are
more difficult than others, it means that we do not really know what we are measuring.
Little research shows a concern for construct validity in examination questions, i.e. that a
question measures what it claims to measure (e.g. Ahmed and Pollitt, 1999). Pollitt et al.
(1985) identified sources of difficulty (SODs) and sources of easiness (SOEs) through
empirical work from Scottish examinations in five subjects (Mathematics, Geography,
Science, English and French). Scripts from examinations were analyzed statistically in
order to identify the most difficult questions. The students answers to these questions
were then analyzed with the aim of discovering the most common errors made when
answering these questions. From these errors, the authors hypothesize that there were
certain Sources of difficulty (SODs) and of easiness (SOEs) in the questions. They
proposed three different categories of SODs: The concept difficulty, which is the intrinsic
difficulty of the concept itself; the process difficulty meaning the difficulty of cognitive
operations and demands made on a candidates cognitive resources; and the question
difficulty, which may be rooted in the language of questions, the presentation of
questions, etc.
In order to verify whether the hypothesized SODs affect students performances,
questions from a mathematics examination (GSCE-General Certificate of Secondary
Education) were manipulated and rewritten in order to remove some specific SODs (e.g.
Fisher-Hoch and Hughes, 1996; Fisher-Hoch et al., 1997). Results show that differences
in performance can be influenced quite significantly with small variations in the
questions by removing or adding a source of difficulty (context of question, ambiguous
resources, etc.). They propose an analysis of the sources of difficulty in exam questions
that would enable us to develop questions of higher construct validity and effectively
target different levels of difficulty. Ahmed and Pollitt (1999) investigated the issue of
what makes questions demanding and developed a scale of cognitive demands in four
dimensions: complexity of the question (number of operations that have to be carried
out), abstraction (to what extent the student has to deal with ideas rather than concrete
objects or events), resources (text, diagram, picture, etc.) and strategy. They show that
questions having the higher cognitive demand in terms of the four dimensions on the
scale tend to have more SODs occurring at different stages of the answering process.
Webb (1997) developed the Depth of knowledge (DOK) model to analyze the cognitive
expectation demanded by assessment tasks. The Webbs DOK levels for Science (Webb,
1997; Hess, 2005) is a scale of cognitive demand and reflects the cognitive complexity of
the question. The DOK level assigned should reflect the complexity of the cognitive
processes demanded by the task outlined by the objective. Ultimately, the DOK level
describes the kind of thinking required by a task, not necessarily whether or not the task
is difficult. The DOK or the cognitive demands of what students are expected to be able
to do is related to the number and strength of the connections within and between ideas.
The DOK required by an assessment is related to the number of connections of concepts
or ideas a student needs to make in order to produce a response, the level of reasoning,
and the use of other self-monitoring processes. It should reflect the level of work students
are most commonly required to perform in order for the response to be deemed
77
Strand 11
acceptable. DOK levels name four different ways students interact with content. Each
level is dependent upon how deeply students understand the content in order to respond.
As mentioned above, the Webb levels do not necessarily indicate degree of difficulty in
that DOK Level 1 can ask students to recall or restate either a simple or a much more
complex concept or procedure. Recall of a well-known concept will correspond to a low
degree of difficulty (students high score), whereas a recall of a concept that is not known
by a majority of students will lead to a high degree of difficulty of the item (students low
score). However, in both cases, the cognitive demand or DOK Level is 1. Conversely,
understanding a concept in depth is required to be able to explain how/why a concept
works (Level 2), apply it to real-world phenomena with justification/supporting evidence
(Level 3), or to integrate one concept with other concepts or perspectives (Level 4).
Pollitt and Ahmed (2000) investigate how context affects students processes of
answering an examination science question. The analysis of contextualized science
questions shows how school students can be misled and biased by the improper use of
context. The concept of difficulty used in the studies presented above and in ours, is
embedded in the interactions between the student and the item itself. Therefore it is not
possible to attribute the source of difficulty either to the item or to the student. Indeed, in
our a priori analysis of the item, we engage the representation of the students interacting
with the items in order to anticipate the students difficulties. Consequently, in this paper,
the students difficulties and items difficulties will not be systematically differentiated.
On the other hand, the level of difficulty defined by PISA, based on statistical analysis of
students scores is clearly different and it will always be called PISA level of difficulty.
Based on these studies, our hypothesis is that different factors influence the difficulty or
easiness of an item.. The different factors that we select in our analysis are:
the cognitive demand (or cognitive complexity) based on Webb DOK levels,
required to produce a right answer for a PISA item that we will have previously
determined in an a priori analysis;
the difficulties of the vocabulary found in the text of the unit determined in an a
priori analysis and confirmed by the observation of students in situation; let us
note that PISA main assessment consists of a series of units; each unit has an
introduction with a text and possibly photos, drawings, or diagrams presenting a
situation followed by a series of questions called items (an example of an
introduction and one item is given in figure 1).
the context of the item, that is the distance between the context of the item from
which the student should extract and apply information and the context in which
they have probably already made use of this information. We will determine it in
an a priori analysis and confirm by the observation of students in situation.
the question format (open answer, multiple choice, complex multiple choice)
determined in a priori analysis . Moreover, in a previous study (authors, 2012)
concerning the effective competencies involved in PISA items, we observed that
some PISA items may offer possibilities of supporting some students answering
strategies that do not require items understanding but that can lead in some cases
to the right answer. This constitutes a potential misinterpretation of the PISA level
of difficulty. Therefore, we take into account the answering strategies that can be
78
Strand 11
METHODS
For this study, we proceed in two steps: first, an a priori analysis and secondly, an
analysis of students processes while answering PISA items.
analyzing each unit according to several criteria to select a set of units and items
to test;
characterizing the cognitive demand required to produce a right answer for each
selected item;
characterizing the difficulties due to the vocabulary in the text for each selected
PISA item.
To select relevant units for studying French students processes, we base our a priori
analysis of all the units on the following criteria:
the diversity of scientific knowledge required for the item (knowledge of science
or about science included in the school curriculum or not, daily knowledge, etc.)
the different content areas tested in the item (Physical systems, Living systems,
Earth and space systems, Technology systems, Scientific Inquiry, etc.)
the usefulness or not of a unit introduction to answer the items, the item format
(multiple choice, open, etc.), and the competency evaluated by PISA for the item
the scores obtained for each item in France compared to OECD average scores.
To characterize the cognitive demand, we use the four levels of cognitive complexity
proposed by Webb (1997). Level 1 (Recall and Reproduction) requires recall of
information, such as a fact, definition, term, or performance of a simple process or
procedure. Level 2 (skills and concepts) includes the engagement of some mental
processing beyond recalling or reproducing a response. This level includes the items
requiring students to make some decisions as how to approach the question or problem.
79
Strand 11
These actions imply more than one cognitive process. Level 3 (Strategic Thinking)
requires deep understanding as exhibited through planning, using evidence, and more
demanding cognitive reasoning. The cognitive demands at level 3 are complex and
abstract. An assessment item that has more than one possible answer and requires
students to justify the response they give would most likely be at level 3. At level 4
(Extended Thinking) students are expected to make connections, relate ideas within the
content or among content areas, and have to select or devise one approach among many
alternatives on how the situation can be solved.
To characterize the vocabulary difficulties in the text of PISA items, in the a priori
analysis we first notice all the items in which students may have difficulties and secondly
during the analysis while they answer the PISA item we check whether the students
effectively meet them with certain words.
The second step consists of analyzing the answering processes from students oral and
written data when they construct their answer and/or during the following interview.
The data used for these analyses consist of the video and audio recordings of 21 students
(9 pairs of students and 3 answering individually), 15 years old (age of PISA evaluation)
who differ by their academic level (8 high achievers and 13 low achievers).
During the interview, we asked the students to make their thinking process explicit about
how they answered the questions. We used the explicitation interview (Vermersch &
Maurel 1997). To analyze the videotapes, we used the software Transana
(http://www.transana.org).
All students (21) had 23 same items from 10 PISA units in their questionnaire.
First, the a priori analyses of questions and the case studies on students allow us to
confirm or not the sources of difficulty or easiness that we proposed. Then we use these
confirmed sources to interpret the students scores in France and the differences observed
between French and OECD countries scores.
RESULTS
After comparing the observed students sources of difficulty or easiness when answering
the 30 selected PISA science 2006 items to the factors we hypothesized, we propose a
classification of these different PISA items according to the main factors that can play a
role in students answers.
Our observations confirm the sources of difficulty or easiness in PISA items that we
proposed: cognitive complexity, familiarity/unfamiliarity of the item context, item
format, vocabulary. Thus, these SODs described for some of them, in literacy for
students assessment related to curricula were also observed for PISA items even if the
importance of the different SODs might differ between the two types of evaluation. The
role and the importance of the answering strategy already found in our previous study
(authors, 2012) is confirmed and moreover the role of students knowledge is highlighted.
Table 1 shows the results obtained for the 30 selected PISA items. For each PISA item,
we report the Webbs DOK levels for science determined in our a priori analysis. In the
second column, we indicate with two levels (familiar/unfamiliar) our appreciation of the
80
Strand 11
distance between the context of the item from which the student should extract and apply
information and the context in which they have probably already made use of this
information. The appreciation of the context is very cultural and can differ from one
country to another. However an item about a topic taught in the curriculum does not
necessarily imply a familiar context for the student. For instance, a PISA unit is about
genetically modified crops which are not in the curriculum in France for a 15 years old
student. Nevertheless, genetically modified crops is a controversial and highly publicized
topic in France, and thus the context of the situation from which this item is based, is
familiar for French students. Likewise, the teaching context of a topic in the French
curriculum does not necessarily imply that the context of an item connected to this topic
will be familiar for the student. For instance, an item is about evaporation, which is
taught in grade 7 in France. But the item situation (how to transform salt water in
drinkable water) is unusual for French students and makes the item context unfamiliar for
them.
The third column indicates the question format, and the fourth column describes the
items vocabulary familiarity. Then, for each selected item, we report the items PISA
level of difficulty according to OECD, the OECD and French scores, and finally the
number of right, wrong and no answers obtained in our sample. We classify 6 different
groups of items according to the factors that we consider the most relevant in explaining
different items PISA levels of difficulty given by students scores: the level of
complexity, the answering strategy, the item context unfamiliarity or familiarity.
Items with high level of complexity and high PISA level of difficulty for students:
The PISA level of difficulty is reflected by the low students scores. The level of
complexity of the items is enough to explain the difficulty.
Items with high or medium level of complexity which can be solved with
answering strategies (matching, association) without understanding the text.
The level of complexity can be as high as in the first group, but the scores are better.
Actually, we observed that, for these items, students can use answering strategies leading
to the right answer, even if they do not understand the text and the aim of the item or the
experiment. Two major answering strategies appear, in particular with low achievers:
-matching the words: Students search for some wording consistency between the item
words
and words in the leading text of the unit or in the introduction of the item.
-association of action verbs: For instance for an item, students associate the words
cooling and reduce (as opposed to increase) and finally choose the right answer.
81
Strand 11
Item contexts are familiar but students do not have the knowledge
The level of complexity is low, but the students show lack of
knowledge.
Item contexts are unfamiliar, students do have the knowledge The level of
complexity is low, even if the knowledge is common, students are not able
to mobilize it in another context than they are used to.
Items permit answering strategies (as matching or association) used by
students that can lead to the wrong answers.
The level of complexity is low, but students, in particular low achievers, use answering
strategies such as matching words or associating action verbs, leading to the wrong
answers.
First, we observe that higher complexity items correspond to items with the highest levels
of difficulty (from 1 to 6, given by PISA according to students scores,) . Nevertheless,
our results reveal that factors other than complexity can influence the difficulty or
easiness of the item. . Indeed, the item format (open answer, multiple choice, complex
multiple choice) appears to have an impact as well on the difficulty of an item. For
instance, in table 1, we can observe that all items requiring an open answer display the
highest PISA level of difficulty (between 3 and 5). We made the same observation at the
whole PISA science 2006 evaluation scale. The context of the item from which the
student should extract and apply information appears to influence the item difficulty as
well. This is particularly noticeable for items coded S408Q03, S304Q03a, S304Q03b,
Q268Q06, and S447Q04 which have a low DOK level (from 1 to 2) but show a high
PISA level of difficulty. The reason for this difficulty is very likely the unfamiliar
contexts. Our analysis shows that students have the knowledge required in these items but
the knowledge is difficult to mobilize in this context. Our analysis reveals that possible
answering strategies that can be used by students can influence the PISA level of
difficulty. These answering strategies such as matching words between the item
introduction and the different propositions in the case of a multiple choice item, or
associating action verbs (for instance cooling and reducing in an item) are mostly used by
low achievers without understanding and representing the aim of the question (Authors,
2012). These answering strategies can lead to the right answer, even if the item displays a
high complexity (for instance S476Q03 or S268Q01); but it can lead to the wrong
answers (for instance S213Q01). Consequently, it should be taken into account in the a
priori analysis of the PISA level of difficulty of a question, in particular common
answering strategies such as matching or associating action verbs.
Moreover, from our observations while students are solving PISA items, it appears that
the familiarity of vocabulary obviously influences the understanding and the solving of
the item. But we cannot draw clear conclusions about a supposed correlation between
unfamiliarity of vocabulary and item PISA levels of difficulty from our classification.
Our results show clearly that compared to the OECD average, French students have the
same level of difficulty in solving high cognitive complexity items. On the contrary they
obtain lower scores for some rather easy items requiring knowledge that they apparently
do not have.
82
Strand 11
Furthermore, we observe a link between high complexity PISA items and a high level of
difficulty, whereas PISA low complexity items display a large range of levels of
difficulty. We conclude that in PISA evaluation, it appears possible to anticipate a high
level of difficulty for an item displaying a high complexity. In the case of a low DOK
level item, the complexity level cannot be a sufficient factor to predict item difficulty.
Other factors than complexity have to be taken into account while creating items to
predict difficulties they may have and to specify what is actually assessed. In particular, if
we want to evaluate low achievers and understand their actual level, some new PISA
items with a low level of difficulty need to be added.
IMPLICATIONS
Our study shows that factors other than the complexity of a PISA item can influence the
PISA level of difficulty or easiness and thus the students scores. This analysis shows the
variety of possible students difficulties in answering questions: item context, question
format, and vocabulary. It shows as well that possible answering strategies can influence
PISA level of difficulty. Our study shows that making simple relationships between a
wrong answer and a lack of competency creates the risk of misinterpreting students
competency. It appears that French students show lower scores compared to the OECD
average for some items requiring knowledge, whereas no score difference is observed for
high complexity items. Furthermore, we observe a link between high complexity PISA
items and a high PISA level of difficulty, whereas PISA low complexity items display a
large range of PISA levels of difficulty. We conclude that in PISA evaluation, it appears
possible to anticipate a high PISA level of difficulty for an item displaying a high
complexity. In the case of a low DOK level item, the complexity level cannot be a
sufficient factor to predict item difficulty. Other factors than complexity have to be taken
into account while creating items to predict difficulties they may have and to specify
what is actually assessed. In particular, if we want to evaluate low achievers and
understand their actual level, some new PISA items with a low level of difficulty need to
be added. Our results could be of interest for policy-makers, large scale assessment
developers and also particularly for teachers in the case of a reflection on science
evaluation carried out in class. It could alert them to the variety of difficulties that their
students might have in their own assessments which they do not suspect.
REFERENCES
Ahmed, A. & Pollitt, A. (1999). Curriculum demands and question difficulty. Paper
presented at the Annual Conference of the International Association for Educational
Assessment. Slovenia. http://www.iaea.info/documents/paper_1162a1d9f3.pdf
Authors, (2014). Which effective competencies do students use in PISA assessment of
scientific literacy? in C. Bruguire, A. Tiberghien & P. Clment (Eds.), ESERA
2011 Selected Contributions. Topics and trends in current science education.
Springer.
Bybee, R., Fensham P., & Laurie R. (2009). Scientific Literacy and Contexts in PISA
2006 Science, Journal of Research in Science Teaching, 46, 862-864.
83
Strand 11
Fisher-Hoch, H. & Hughes, S. (1996). What makes mathematics exam questions difficult?
Paper presented at the British Educational research Association Annual
Conference, University of Lancaster.
http://www.leeds.ac.uk/educol/documents/000000050.htm
Fisher-Hoch, H., Hughes, S. & Bramley, T. (1997). What makes GCSE examination
questions difficult? Outcomes of manipulating difficulty of GCSE questions. Paper
presented at the British Educational research Association Annual Conference,
University of York. http://www.leeds.ac.uk/educol/documents/000000338.htm
Hess, K. (2005). Applying Webbs Depth-of-Knowledge (DOK) Levels in Science.
http://www.nciea.org
OECD, (2007). PISA 2006: Technical Report. Paris: OECD Publications.
Olsen, R. & Lie, S. (2010). Profiles of Students Interest in Science Issues around the
World: Analysis of data from PISA 2006. International Journal Of Science
Education. 33, 97120.
Pollitt, A. & Ayesha A. (2000). Comprehension Failures in Educational Assessment.
Paper presented at the Euopean Conference on Educational Research, Edinburgh.
http://www.cambridgeassessment.org.uk/ca/digitalAssets/113787_Comprehension_
Failures_in_Educational_Assesment.pdf
Vermesch, P. & Maurel, M. (1997). Pratique de lentretien dexplicitation. Paris :
Editions ES.
Webb, N. (1997). Criteria for Alignment of Expectations and Assessments on
Mathematics and Science Education. Research Monograph No. 6. Washington,
D.C.: CSSO.
84
Strand 11
INTRODUCTION
The Brazilian Ministry of Education (MEC) organizes a National Test for students at the end of
High School (ENEM) since the year of 1998. In 2009 major changes were introduced, which
attracted a great number of students, not only those who are actually at the last year of High School.
All people who aspire to a university degree seem to have been encouraged to pursue such testing,
given the reward introduced for a good score, in the form of a place in a free public university.
Students have been challenged to achieve the highest possible mark, which would enable them to
apply for a place in a computerized system (SISU) provided by MEC, which compares ENEM
scores of students and assigns seats in public universities all over the country. In the year 2013 over
seven million students were enrolled in ENEM, competing for places in free public universities. In
addition to SISU, students can compete for over 170,000 scholarships in private universities
(PROUNI), which can be as high as 100% of the tuition fees, given some conditions related to
students socioeconomic status. According to official MEC information, about 110,000 students
enrolled in the first version of the then national test in 1998, and no one could believe that seven
million people would be enrolled the same test fifteen years later (2013 exam), competing for about
170,000 places in public universities throughout Brazil (January 2014).
ENEM is known for avoiding traditional questions, which rely heavily on the recollection of factual
knowledge. Since it was launched, it was presented as a new strategy to assess directly students
competencies, defined by an official document as structural modalities of intelligence (Franco
and Bonamino, 1999:29). The new test was warmly welcomed by the Brazilian press and broadly
marketed in grey literature, which is difficult to quote. Apparently it was taken as a strategy not
85
Strand 11
only for a new assessment-based educational reform, but also for social reform, as it would help
poor students to pursue a path to higher education and, in addition, was aimed explicitly at reaching
the job market. MEC presented the test as an opportunity for youngsters to plan their futures, having
a clear idea of their personal and professional potential, as the test would allow assessing their
potential in order to plan future choices (Zkia and Oliveira, 2003: 884). Even today, the Novo
Enem (New ENEM) is officially presented by MEC as a tool for democratization of access to
public institutions of higher education, which are free, to promote academic mobility and to induce
changes in high school curricula (MEC, 2013).
ENEM was originally based on five competencies and 21 abilities, aimed at reaching an
interdisciplinary approach, with no mention to specific school disciplines or subjects. The major
reform which took place in 2009 created the Novo ENEM (New ENEM) with a major increase
in the number of competencies and abilities under assessment, references to conceptual disciplinary
knowledge were introduced, and the total number of questions increased dramatically. The original
63 multiple choice questions (plus an optional written composition) performed in one afternoon
became, now in the new version, 180 questions (plus a compulsory written composition) and two
days are necessary, with a tight time schedule, which allows three minutes per item. They are taken
as unidimensional, as Item Response Theory (IRT) is now applied to establish final scores.
However, the major features of items construction seem to be essentially the same: some visual and
written context is given, followed by a stem and five options. Recollection of facts and concepts
should be rarely necessary, at least in the form of conceptual definitions; the essential information to
find the right option is supposedly part of the context given.
Previous research carried out with PISA items, which are also based on a stimulus which tells a
story to which the test items relate more or less directly, categorized items according to the level
of contextualization (Nentwig, et al, 2009). Items with high level of contextualization had
stimulus content which was essential for information extraction and processing, whereas items with
low level of contextualization brought stimulus which was not essential for answering the
question. In that piece of research both stimulus Content and Relevance were taken into
consideration in a threefold scale, in which items could have substantial information, which was
relevant for item solution (score 2), or could have some text or information but stimulus information
was not relevant for solution (score 1). Items could also bring few or no information as stimulus
(score 0).
Authors provided examples of items of PISA 2006 in which the question can be answered and
exclusively so with the recollection of factual knowledge not related to the stimulus, and were
coded 1. Their objective was to carry out further performance studies of selected questions,
comparing students of different countries, in order to understand how well German students could
extract and process information, rather than find the right answer recollecting factual knowledge.
Data is presented here testing the hypothesis that stimulus in a group of selected ENEM questions
was actually relevant for student performance in biology. Instead of simply rating questions on the
basis of stimulus Content and Relevance by judges, as done in the cited article, an additional step
was added. Low contextualization questions, corresponding to score 1 of Nentwig et al, 2009, were
selected and presented to students in two forms: full length, with the original stimulus, and abridged
version, in which stimulus was removed, leaving just the stem and options. Scores on the two
groups of students are presented and we discuss methods for identifying possible flawed multiple
choice items.
86
Strand 11
METHODS
A sample of seven questions with low level of contextualization clearly related to biology were
selected in the 2009 and 2010 tests (Novo ENEM), which were presented in 2011 to two
randomized groups of High School students. One group (n0=233) was asked to answer original
questions (Full ENEM), in a six-page long questionnaire; another group of similar students
(n1=200) was asked to answer the same questions with written stimulus entirely removed, leaving
the stem and the very same options, in the form of a three-page long questionnaire (Abridged
ENEM). Another three questions (standard questions) were included in the two sets of
questionnaires, focusing Biology subjects, with exactly the same brief stem and five options, for
comparison purposes. As survey participants were not selected by randomised procedures, these
questions would test general biology knowledge of the two groups, ascertaining their proficiency in
the field (Biology) was equivalent, and therefore the sample could be reliable for the only purpose
of comparing items. According to quota sampling techniques, choice of quota controls would
challenge the quota sampler's ingenuity , as quota variables should be strongly related to the
survey variables () thereby becoming substantially homogeneous. As Leslie Kish states, quota
sampling is not a standardized scientific method; rather, each one seems an artistic production
(Kish, 1965: 563), and a overview is provided below.
Each research assistant received one set of questionnaires, either short or long, and was responsible
for submitting it to students of one public high school of the city of So Paulo (SP, Brazil). Fourteen
schools were chosen according to assistant's convenience, as access to schools is quite difficult, and
test was performed by students within a specific week in mid September. Research assistants were
not aware of the differences of the two sets of questionnaires. The invitation letter required by the
Ethics Commission of our institution (FEUSP) was part of every questionnaire, and stated that
students were invited to collaborate in a research about assessment; they would not be identified in
the answer sheet, and the several participating schools in this piece of research would not be
identified or ranked.
School validation relied on a two level process. Reports of how the questionnaire was presented to
students and answered were analyzed, prior to the answer processing. Any kind of reported
situations which were not exactly the ideal ones led to school exclusion. For instance, when
different research assistants went to the same school, it was excluded from the sample, as students
could have had notice of the different length of the questions. We could validate fourteen schools at
this level. On another level of scrutiny, as part of the statistical analysis, school results were studied,
a search for outliers was carried out (see below), and one case was found in the group of schools
where abridged questions were presented, and the report of that specific school was reconsidered.
The school has a long record of good performance in large scale evaluations, but students now had
very low scores compared with the average of other schools. Score on the standard questions were
11%, which is surprisingly low for items with five options. The conclusion was that this specific
school was close to the university campus and students were not motivated to perform the test, as
they are quite used to similar university experiments. As they could not recognize items as
ENEM questions the task was probably seen as a waste of time. Therefore, that school was
considered an outlier, and the number of students aswering abridged questions was corrected to
n=127 (889 items analyzed). Students which answered full ENEM questions was n=233 (1631
items analyzed), with a total sample size of 360 students belonging to 13 schools, and 2,520 ENEM
items and 861 standard items analyzed.
ITEMS EXAMPLES
The following examples show the twofold forms of presentation of selected items. In the full
version items were reproduced from the beginning, where the question number appears for the first
87
Strand 11
time. In the abridged version, stimulus was removed, and the version presented to students began
where the question number appears for the second time, in the examples below. Colors will be
discussed below.
7 (full) - The biogeochemical carbon cycle comprises various compartments, including Earth,
the atmosphere and the oceans, and various processes allowing the transfer of compounds
between these reservoirs. Carbon stocks stored in the form of non-renewable resources, such as
oil, are limited, being of great importance to realize the importance of replacing fossil fuels by
renewable fuels.
7 (abridged) - The use of fossil fuels affects the carbon cycle, as it causes:
a) increase in the percentage of carbon on earth.
b) reduction in the rate of photosynthesis of higher plants.
c) increased production of carbohydrates produced by plants.
d) increase in the amount of atmospheres carbon.
e) reduction of the overall amount of carbon stored in the oceans.
8 (full) - A new method for producing artificial insulin using recombinant DNA technology was
developed by researchers at the Department of Cell Biology, University of Brasilia (UNB) in
partnership with the private sector. Researchers have genetically modified Escherichia coli
bacteria, which became able to synthesize the hormone. The process allowed the manufacture of
insulin in larger quantities and in only 30 days, one third of the time required to obtain it by the
traditional method, which consists in the extraction of the hormone from slaughtered animals
pancreas.
Cincia Hoje 24 April 2001. Available at: http://cienciahoje.uol.com.br (adapted).
8 (abridged) - The production of insulin by recombinant DNA technique has, as a consequence :
a) improvement of the process of extracting insulin from porcine pancreas .
b ) the selection of antibiotic-resistant microorganisms .
c ) progress in the technique of chemical synthesis of hormones.
d ) favorable impact on the health of diabetics .
e) creation of transgenic animals.
Distractors' keywords appear in color, associated with related terms in stimulus. As Thiessen
et al (1989) argued, they play an important role in item planning, and improve options'
plausibility. As we will argue later, a long, but not relevant, stimulus may improve the
effectiveness of distractors to the point of flawing the whole item.
RESULTS
The total number of questions focusing the national test was 2,520 (Table 1), other 861 standard
questions were included in order to test sample homogeneity (Table 2), with a total number of 3,381
questions answered and processed.
Statistical analysis included parametric essays, and search for outliers. One school (EEI1FB, n=73)
fell into this category, as previously mentioned, and was excluded from the sample. Fishers Exact
Test for ENEM questions (Table 1) reported p-value without statistically significant differences
between the groups on four questions (Q1, p-value= 0.906; Q5, p-value=; 0.077; Q9, p-value =
0.901; Q10, p-value = 0.152), and statistically significant differences on three questions, in favor of
abridged questions (Q03, p-value = 0.006; Q7, p-value < 0.001 e Q8, p-value < 0.001). Results of
the same statistical analyses for the three standard questions (Table 2) confirmed the sample's
homogeneity of the two groups (Q2, p-value = 0.787; Q4, p-value = 0.116 e Q6, p-value = 0.140).
88
Strand 11
Table 01
Right answers of the 2,520 ENEM questions (F.E.T= Fisher Exact Test)
N
1
2
3
4
5
6
7
8
School
EEA0HD
EEB0NL
EEC0SB
EED0XS
EME0EA
EEF0PM
EEG0GC
EEH0BT
Total
n0
15
52
32
13
32
34
13
42
233
Q1
1
18
15
7
12
21
7
27
108
46%
09
10
11
12
13
EEK1BM
ETL1HV
EEM1BC
EEN1MS
EEO1HF
Total
n1
43
25
19
18
22
127
F.E.T
p.value
Q1
14
10
14
4
20
62
49%
0.906
Q8
2
18
16
6
4
19
3
9
77
30%
Q9
3
10
10
7
17
3
9
21
80
34%
Q10
3
12
21
8
7
17
11
20
99
42%
Q8
29
7
5
10
18
69
54%
<0.001
Q9
13
11
14
3
4
45
35%
0.901
Q10
15
18
6
11
14
64
50%
0.152
Table 02
Results of the 861 standard questions (F.E.T= Fisher Exact Test)
N
1
2
3
4
5
6
7
8
09
10
11
12
13
Q6
2
4
7
5
7
7
4
10
46
20%
Q6
4
10
6
1
12
33
26%
0.140
89
Strand 11
Table 1 presents the results of the two groups of experimental questions. Considering this group of
low contextualization items, the hypothesis that stimulus is relevant to student performance found
no support, confirming previous categorization. An even more surprising result was found, as
comparing the two groups of ENEM questions answers it is possible to state that questions Q3, Q7
and Q8 allowed a statistically significant higher student performance when they brought no
stimulus, showing a phenomenon we named reversed induced performance (rip). In other words,
jumping stimulus brought to students, in this group of questions, either the same or even better
probability of a good performance.
A further analysis was performed with linguistic tools looking for causal explanations of these
surprising results. The group of students which answered items with no stimulus, went directly to
the stem line, and was not influenced by the text presented to the other group. These texts had
keywords, such as oil and insulin, which were also inadvertently referred to by their superordinated
words (fossil fuels, and hormones), demanding previous knowledge for full understanding.
In the item examples given, question 7 brings a text with poor information on the topic of carbon
cycle, and has lack of cohesion, comprising also the global warming issues. Item stem explores
previous student knowledge on a specific topic (effect of fossil fuels on the atmosphere). Without
previous knowledge, students, under pressure due to the tight time schedule, would read options
directly looking for similarities between keywords found there and in the text. There are three
carbon reservoirs mentioned in the text, and they appear on three different options. The stimulus
would drive students attention to these three options, whereas without it they would face a different
situation, thus becoming weak distractors. Fuel is a keyword in the stem, which easily connects to
the idea of combustion and smoke. The closest keyword is atmosphere, which is found in the
right answer. Therefore, lack of cohesion of the text could lead students to jump stimulus, and
concentrate in the stem, rising the probability of success, including reasons other than those
originally thought. This trajectory could explain the observed rip.
The other example is even clearer, as question 8 was presented above so that keywords were
colored, as their related terms, in the options and item stem. Apparently, students have to apply
information given in the text, as stem is plenty of keywords such as insulin. Stimulus brings
keywords which appear (or have correlated ideas) in four distractors. The only option which has no
connection with stimulus, as mentions diabetics, is the right one. Students should recollect facts
about hormones and insulin, and know something about the related diseases, as stimulus has evident
lack of cohesion regarding the context of the right answer, related to diabetics treatment. Students
who read stimulus would be bound to focus attention on the four distractors. There is a clear lack of
cohesion, as stimulus does not mention any disease; this strategy of diverting students attention by
changing subject, making stimulus not relevant for the answer, we called bafflement, which tends
to improve rip. In fact, this was the question with the greatest difference between the two groups
(Table 1). In real action, students could jump stimulus and would not be mislead to concentrate their
attention in the wrong options; with previous knowledge about insulin and related disease, answer
would be easily found. Therefore, it is possible to understand the observed rip as a consequence
of this bafflement strategy.
The four items in which no statistical difference between the abridged and full questions was found
also deserve analysis, as students who jumped stimulus, and went directly to options, were as
successful as those who did all the reading. However, as they have only three minutes for each
question, jumping students would have saved precious time for other questions, rising the
probability of a higher final score in real action.
These results show that low contextualization in ENEM questions with focus on Biology do rely on
students previous knowledge, and are not objective indicators of the alleged structural modalities
90
Strand 11
of intelligence. Moreover, items actually favor students with better reading and time management
skills than a balanced amount of biological knowledge and thinking skills.
DISCUSSION
Results show that low contextualization items (Nentwig et al, 2009) deserve more attention
regarding future research. If presented with a stimulus, which demand a considerable length of time
to be read and understood (score 1 in the cited article), they are actually context deficient (contdef). These items allow at least two different paths for the right answer, what brings a considerable
problem for the task of determining its degree of difficulty, with a serious implication for the Item
Response Theory. Contrary to direct items with no context (score 0 in the cited article), or with
actually relevant information in the stimulus (score 2 in the cited article), cont-def questions not
only allow similar probability of success with stimulus or without it, as seen in questions 1, 5, 9 and
10 (Table 1), but also may turn the question even more difficult. As seen in questions 3, 7 and 8
(Table 1), scores of students that received stimulus were significantly lower, showing a new
phenomenon, which we called reverse induced performance (rip).
This new phenomenon should be focused carefully in items pre-testing, as it brings a profound
effect to the determination of the degree of difficulty. The validity of Item Response Theory
requires unidimentional items, therefore items must be rip-free. This piece of research offers a
practical approach to perform such testing, with two randomized groups of students, one of them
receiving abridged items, which are suspect of being cont-def.
This research brought a new light to a long known fact, related to the commercially successful
ENEM training courses, privately owned, which have been active at least since 2003 (Zkia and
Oliveira, 2003: 885). There was suspicion that they were useless, as students would receive all, or
almost all, information needed in items' context stimulus, therefore, training would be of no help to
raise students performance in ENEM. All seven experimental questions produced results that can
explain the need of a specific student training to get higher scores in that exam. Within a very tight
time schedule, students may be trained not only to extract and process information given in the
stimulus, but also and mainly - to select and discard information which is not relevant to assign
the right option or even to lower distractors' efficiency.
The 2009 reform turned Novo ENEM not only into an instrument to select students for public
universities, but also aiming at monitoring education quality in a nationwide basis. The
democratization of higher education access provided by ENEM (if any) may be due to sudden
changes and would tend to disappear as time management skills are differently apprehended by
students in the socioeconomic spectrum.
The proposal of turning ENEM into a compulsory State Exam is currently under discussion. Our
results suggest that assessment-based educational reform and education quality monitoring based on
this instrument should be considered with caution. Further research is necessary, encompassing
other content areas, in order to have a clearer idea of the real impact of low contextualization items
in large scale exams such as ENEM.
ACKNOWLEDGEMENTS
Authors want to express their gratitude to the following persons: Alessandra Lupi, Alessandra
Stranieri, Alessandra Ramin, Andria Vieira, Ariana Carmona, Bianca Dazzani, Bruna Loureno,
Bruno Vieira, Carolina Bueno, Cristiano J. da Silva, Dbora Brandt, Fernando Sbio, Giselle
Armando, Guilherme Antar, Guilherme Stagni, Helenadja Mota, Henrique Neves, Joo Ferreira,
Karina Tisovec, Laisa Lorenti, Mariana Rosim, Marina Medeiros, Natacha Lodo, Pedro Machado,
91
Strand 11
Priscylla Arruda, Rafael Ogawa, Renato Rego, Rodrigo Gonalves, Rodrigo Dioz, Samara Moreira,
Talita Oliveira, Thas de Melo, Thales Hurtado, Thiago Madrigrano, Vitor Lee. The following
institutions provided funds for the several parts of the research: CNPq, FAPERGS, FAPESP, FEUSP
and Pr-Reitoria de Pesquisa da USP.
REFERENCES
Franco, C., A. Bonamino (1999). O ENEM no contexto das polticas para o ensino mdio.
Qumica Nova na Escola 10, 26-31.
Kish. L. (1965). Survey Sampling. New York: Wiley & Sons, Inc.
Ministrio da Educao (1999). Exame Nacional do Ensino Mdio ENEM: documento bsico
2000. MEC/INEP.
___________(2013). Sobre o ENEM. Available at http://portal.inep.gov.br/web/enem/sobre-oenem [access on Dic 15 2013].
Nentwig, P., Roennebeck, S., Schoeps, K., Rumann, S. and Carstensen, C. (2009). Performance
and levels of contextualization in a selection of OECD countries in PISA 2006. Journal of
Research in Science Teaching, 46 (8), 897-908.
Orlandi, E. P. (2012). Discurso e leitura. So Paulo: Cortz.
Thiessen, D. L. Steinbeck and A.R. Fitzpatrick (1989). Multiple-choice items: the distractors are
also part of the item. Journal of Educational Measurement 26 (2),161-176.
Zkia, S.; R. P. Oliveira (2003). Polticas de avaliao da educao e quase mercado no Brasil.
Educao e Sociedade 24 (84), 873-895.
92
Strand 11
INTRODUCTION
The dropout rate of German bachelor students in chemistry amounts to 43 % (Heublein, Richter,
Schmelzer & Sommer., 2012). The main reasons for leaving university are difficulties in
performance and lack of motivation (Heublein, Hutzsch, Schreiber, Sommer & Besuch, 2010)
which is particularly caused by false expectations of their studies (Heublein, Spangenberg &
Sommer, 2003). The highest dropout rates in natural sciences studies occur during the first
semesters at university (Heublein et al., 2003) which can be attributed to the large number of
challenges freshmen are confronted with (Gerdes & Mallinckrodt, 1994).
Other European countries also have excessively high dropout rates. Ulriksen, Mller Madsen and
Holmegaard (2010) state that around one third of students end up their studies before the
scheduled time. Low student success is an important issue at American universities as well, as
only approximately 70 % of freshmen in chemistry pass the final exam in general chemistry at
the end of first semester (Legg, Legg & Greenbowe, 2001; McFate & Olmsted, 1999).
In order to gain a better understanding of student success, it is necessary to find factors that are
able to predict success as soon as studies begin. Up to now, a lot of approaches have been made
93
Strand 11
for determining the factors leading to student success. The attempts date back to 1921 (Powers,
1921). Several variables predicting success have been identified so far. But only few studies deal
with the prediction of student success in chemistry at German universities. This project aims at
filling this gap. Whereas a lot of studies tested a great variety of cognitive and non-cognitive
variables, only few studies also took the interactions between the predictors into account. The
great advantage of doing this, is to gain a deeper insight into the connections between the
variables that predict student success and therefore create a better understanding of how success
can be achieved. Identifying the variables leading to success and the way how those variables
interact, creates a starting point for increasing student success and decreasing dropout,
respectively.
Here, student success is one time defined as the score in the final chemistry examination and one
time as the score in a chemistry knowledge test which both have been taken place at the end of
the first semester (examination two to three weeks later than the test). Whereas the chemistry
exam represents a criterion that is highly relevant for the students, the chemistry test had no
relevance at all for them but was the same for all participating students and therefore is an
objective measure for student success.
For the prediction, a regression model has been built up on the basis of Schiefele, Krapp and
Winteler (1992) who state that usually three pieces are used in predicting academic achievement:
general cognitive factors, general motivational factors, and interest.
The cognitive factors have been split into two parts. As first predictor, prior knowledge has been
included into the model because prior knowledge is the pre-condition for accumulating further
knowledge (Schneider, Krkel & Weinert, 1990). Therefore it is also the basis for learning
growth which is finally measured at the end of the first semester. Prior knowledge is the domainspecific part of cognitive factors and its predictive strength increases with rising consistency of
test and study content (Heine, Briedis, Didi, Haase & Trost, 2006). Then, as rather domainunspecific part of cognitive factors, two variables are added to the regression model: grade from
the secondary school graduation certificate (Abitur) and ability in deductive thinking. The highest
predictive strength as a single predictor has been attributed to the grade from the secondary
school graduation certificate (Abitur). Abilities in deductive thinking are seen as central factor in
measuring cognitive abilities which show medium or satisfactory predictive strength (Heine et
al., 2006). As motivational factor the predictor desired subject is used. This variable contains one
item that asks the students whether they would rather study a different subject than the chosen
one. It could be shown that students who are satisfied with their subject achieve better results in
an exam than students who are unsatisfied with their subject (Ohlsen, 1985). The positive effect
of student satisfaction can be attributed to the fact that learning actions are rather initiated for
satisfied students. Furthermore, it is learnt more successfully, and self-regulated learning, which
has an enormous meaning for (self-)study, can be maintained (Voss, 2007). The next part of the
model is subject interest. Here, ambiguous information can be found. Some studies show no
predictive strength (Gold & Souvignier, 2005) whereas others find it meaningful for student
success (Fellenberg & Hannover, 2006; Giesen, Gold, Hummer & Jansen, 1986). Possibly, the
positive effect of interest on student success evolves only after a longer time of studying (Krapp,
1997). At this point, the model according to Schiefele, Krapp and Winteler (1992) is complete,
but since students from different universities and different courses of study were surveyed, study
conditions as further variable are also included into the model. It is well-known that different
94
Strand 11
study structures and requirements have an influence on student success as well (Krempkow,
2008; Preuss-Lausitz & Sommerkorn, 1968).
STUDY DESIGN
In winter semester 2011/12 freshmen in chemistry from different German universities were
surveyed at the very beginning of the semester (pre-test) and the end of the semester (post-test)
on their knowledge, abilities, and attitudes they bring to university.
HU Berlin
61
21
82
Uni DuE
36
26
62
LMU Munich
80
12
92
Total
177
59
236
Female students (40 %) are a little less present in the sample than male students (60 %). There are
also some differences concerning the students year of receiving their secondary school
graduation certificate (Abitur). Sixty-eight percent did their final exam just before starting their
studies in 2011; 20 % one year before in 2010 and 11 % earlier than that.
Strand 11
chemical topics and matters is important to me. For responding to the questionnaire on subject
interest, a four-point Likert scale is given for specifying ones level of agreement. For the
variable desired subject the following item has been used: I would rather study a different
subject. Students could answer with yes or no. Grade from the secondary school graduation
certificate (Abitur), course of study and university have been asked for with one item each. For
the operationalization of the study conditions, six dummy variables were created: chemistry
students from Berlin, education students from Berlin, chemistry students from Essen, education
students from Essen, chemistry students from Munich and education students from Munich. The
value 1 was given to the referring student cohort, whereas 0 stands for all the other students.
Chemistry students from Munich were taken as the reference group in the regression analyses
since they present the largest group. Validity and reliability for all questionnaires and tests could
be proven in the frame of a pilot study conducted one year before in winter semester 2010/11.
Additionally, scores in the final exam in chemistry at the end of the first semester have been
gathered. Since the students from the different courses of study and universities wrote different
exams, scores were z-standardized for regression analyses. The predictors were added blockwise
and by the enter method to the regression model. The moderation analyses were used for
identifying interactions between the above mentioned predictors. To keep the variables equally
weighted in the interaction terms, all variables were centered, meaning that the mean was set to
zero whereas the variance was kept (Cohen, Cohen, West & Aiken, 2003).
Prior knowledge
Cogn. A.
Grade
Deduct. Think.
Desired subject
Subject interest
Study conditions1
HU Berlin Chemistry
HU Berlin Chem. Educ.
Uni DuE Chemistry
Uni DuE Chem. Educ.
LMU Mnchen Chemistry
R/% [incl. study conditions]
.288
-.250
---.111
---
Knowledge test
t
p
4.713
<.001
-4.170
<.001
-----1.980
.049
-----
-.138
-2.087
.038
-.136
-2.730
.007
-------.300
-4.985
<.001
-.188
-3.272
.001
25.6 % [34.9 %]
.169
-.398
.148
-.173
-------------
Final exam
t
p
2.633
.009
-6.307
<.001
2.385
.018
-2.942
.004
-----
----------26.4 % [27.8 %]
-----------
1 Dummy-coded
96
Strand 11
It can be seen that prior knowledge has a higher effect on the score in the knowledge test
( = .288) than on the outcome in the exam ( = .169). This can be attributed to the fact that the
chemistry knowledge test in pre- and post-test was identical, and in the exam a broader field of
knowledge was asked for. On the other hand, cognitive abilities show a stronger effect on the
exams score (grade: (exam) = -.398 vs. (test) = -.250; deducting thinking: (exam) = .148 vs.
(test) = n. s.). This finding could be due to the fact that the tasks in the exam were more
complex than tasks in the knowledge test, and therefore make higher cognitive demands on the
students. Furthermore, students were possibly more engaged in solving the exam than solving the
test since the exam has a much higher relevance for them. This could also be a reason for the
variable desired subject showing a higher influence on the exam ( = -.173) than on the
knowledge test ( = -.111). An experimental study confirmed that the role of the relevance of a
test score and the students motivation when working on it, show a positive correlation to the
students score in the test (Liu, Bridgeman & Adler, 2012). Study interest shows neither an effect
on score in the test nor on score in the exam.
With the aforementioned predictors, in each of the two regression models approximately one
fourth of variance of the score in the exam and the test, respectively, can be explained. When
adding study conditions, the amount of explained variance is enhanced up to 35 % for the
chemistry test and approximately stays the same for the exam. Study conditions only show a high
effect on the achievement in the knowledge test whereas there is no significant effect at all on the
exams score. This result indicates that study conditions do have an influence on the students
outcome. The absence of an effect of study conditions on success in the final exam shows that the
exams that students wrote at the three universities are different from each other and presumably
include study conditions. From additional analyses (data not shown here) it can be seen that it is
not the content of the exams that makes the exams different. However, they differ from each
other in difficulty. Which variables cause the different levels of difficulty, is not clear at the
moment.
Concerning study conditions, similar results can be gathered from moderation analyses. Whereas
for the score in the exam it is not possible to find interactions between the predictors that are
independent from course of study and university, several interactions can be found when
predicting achievement in the knowledge test (see Figure 1). This finding also shows that the
exams are too different to gain interactions that are valid for all students, independent from the
students affiliation to university and course of study.
The left part of Figure 1 shows that the influence of prior knowledge on achievement is
significantly higher for students with a bad Abitur grade. This means that the better the grade is
the less prior knowledge is important for student success in the chemistry knowledge test. The
right part of the picture shows that the positive influence of prior knowledge on achievement is
significantly higher for students with a high ability in deductive thinking. This could be due to
the fact that students with a high ability in deductive thinking are more able to apply their prior
knowledge.
97
Strand 11
SUMMARY
In the frame of this study, a theory-based regression model for the prediction of student success
in chemistry could be applied. The results show that prior knowledge, cognitive abilities, and the
desired subject are significant predictors for performance at the end of the first semester in
chemistry. Subject interest plays a minor role in the prediction. Student success was measured in
two ways: once through the score in the chemistry exam (as a criterion that is highly relevant for
the students) and once through the score in an objective chemistry test which was in contrast to
the exam the same for all participating students.
All in all the whole regression model explains about one fourth of variance of students
performance at the end of first semester. Additionally, study conditions, operationalized through
the students affiliation to their course of study and their university, show a clear influence on the
test outcome and raises the proportion of the explained variance up to 35 % whereas this effect is
missing for the exam. This result clearly indicates that the exam is part of the study conditions
and that the exams at the different universities and in the different courses of study must be
different, too. Further analyses show that not the content but the different difficulty of the exams
contribute to this finding.
REFERENCES
Cohen, J., Cohen, P., West, S. G. & Aiken, L. S. (2003). Applied multiple regression/correlation
analysis for the behavioral science (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.
Fellenberg, F. & Hannover, B. (2006). Kaum begonnen, schon zerronnen? Psychologische
Ursachenfaktoren fr die Neigung von Studienanfngern, das Studium abzubrechen oder das
Fach zu wechseln [Easy come, easy go? Psychological causes of students drop out of
university or changing the subject at the beginning of their study]. Zeitschrift zu Theorie und
Praxis erziehungswissenschaftlicher Forschung, 20, 381-399.
98
Strand 11
Gerdes, H. & Mallinckrodt, B. (1994). Emotional, social and academic adjustment of college
students: a longitudinal study of retention. Journal of Counseling & Development, 72, 281288.
Giesen, H., Gold, A., Hummer, A. & Jansen, R. (1986). Prognose des Studienerfolgs. Ergebnisse
aus Lngsschnittuntersuchungen [Prognosis of student success: Results from longitudinal
studies]. Frankfurt a. M.: Institut fr Pdagogische Psychologie.
Gold, A. & Souvignier, E. (2005). Prognose der Studierfhigkeit: Ergebnisse aus
Lngsschnittanalysen [Prognosis of college outcomes: Results from longitudinal studies].
Zeitschrift fr Entwicklungspsychologie und Pdagogische Psychologie, 37, 214-222.
Heine, C., Briedis, K., Didi, H.-J., Haase, K. & Trost, G. (2006). Auswahl- und
Eignungsfeststellungsverfahren beim Hochschulzugang in Deutschland und ausgewhlten
Lndern. Eine Bestandsaufnahme [Placement tests in Germany and other selected countries.
A review] . Hannover: HIS.
Heublein, U., Hutzsch, C., Schreiber, J., Sommer, D. & Besuch, G. (2010). Ursachen des
Studienabbruchs in Bachelor- und herkmmlichen Studiengngen [Reasons for dropout in
Bachelor and conventional courses of study]. Hannover: HIS.
Heublein, U., Richter, J., Schmelzer, R. & Sommer, D. (2012). Die Entwicklung der Schwundund Studienabbruchquoten an den deutschen Hochschulen [The development of dropout rates
at German universities]. Hannover: HIS.
Heublein, U., Spangenberg, H. & Sommer, D. (2003). Ursachen des Studienabbruchs [Reasons
for dropout]. Hannover: HIS.
Krapp, A. (1997). Interesse und Studium [Interest and studies]. In: H. Gruber & A. Renkl (eds.), Wege
zum Knnen: Determinanten des Kompetenzerwerbs (pp. 45-58). Bern: Verlag Hans Huber.
Krempkow, R. (2008). Studienerfolg, Studienqualitt und Studierfhigkeit: Eine Analyse zu
Determinanten des Studienerfolgs in 150 schsischen Studiengngen [Student success, study
quality and study ability: An analysis on determinants of student success in 150 courses of
study in Saxony]. Die Hochschule, 91-107.
Legg, M. J., Legg, J. C. & Greenbowe, T. J. (2001). Analysis of success in general chemistry
based on diagnostic testing using logistic regression. Journal of Chemical Education, 78, 1117
- 1121.
Liu, O. L., Bridgeman, B. & Adler, R. M. (2012). Measuring learning outcomes in higher
education: motivation matters. Educational Researcher, 41, 352-362.
McFate, C. & Olmsted, J. (1999). Assessing student preparation through placement tests. Journal
of Chemical Education, 76, 562-565.
Ohlsen, U. (1985). Eine empirische Untersuchung der Einflugren des Examenserfolgs fr
Absolventen wirtschaftswissenschaftlicher Studiengnge der Universitt Mnster [An
empirical survey of determining factors on examination success for graduates of economic
courses of study at Mnster University]. Frankfurt a. M.: Lang.
Powers, S. R. (1921). The achievement of high school and freshmen college students in
chemistry. School Science and Mathematics, 21, 366-377.
99
Strand 11
Preuss-Lausitz, U. & Sommerkorn, I. N. (1968). Zur Situation von Studienanfngern [On the
situation of freshmen]. Zeitschrift fr Erziehung und Gesellschaft, 8, 434-453.
Schiefele, U., Krapp, A., Wild, K. & Winteler, A. (1993). Der Fragebogen zum
Studieninteresse (FSI) [The Questionnaire on study interest (FSI)]. Diagnostica, 39, 335351.
Schiefele, U., Krapp, A. & Winteler, A. (1992). Interest as a Predictor of Academic
Achievement: A Meta-Analysis of Research. In: K. A. Renninger, S. Hidi & A. Krapp (eds.),
The Role of Interest in Learning and Development (pp. 183-212). Hillsdale: LEA.
Schneider, W., Krkel, J. & Weinert, F. E. (1990). Expert Knowledge, General Abilities, and
Text Processing. In: W. Schneider & F. E. Weinert (eds.), Interactions among aptitudes,
strategies, and knowledge in cognitive performance (pp. 235-251). Springer: New York.
Ulriksen, L., Mller Madsen, L. & Holmegaard, H. (2010). What do we know about the
explanations for drop out/opt out among young people from STM higher education
programmes? Studies in Science Education, 46, 209-244.
Voss, R. (2007). Studienzufriedenheit: Analyse der Erwartungen von Studierenden [Student
satisfaction: analysis of student expectations]. Reihe: Wissenschafts- und
Hochschulmanagement, Band 9. Lohmar: Eul Verlag.
Wilhelm, O., Schroeders, U. & Schipolowski, S. (2009). BEFKI. Berliner Test zur Erfassung
fluider und kristalliner Intelligenz [Berlin test of fluid and crystallized intelligence].
Unpublished.
100
Strand 11
THEORETICAL BACKGROUND
Research findings suggest that there are three categories of student difficulties in basic
electricity: inability to apply formal concepts to electric circuits, inability to use and
interpret formal representations of an electric circuit, and inability to qualitatively argue
101
Strand 11
about the behavior of an electric circuit (McDermott & Shaffer, 1992). Misconceptions
are strongly held and stable cognitive structures, which differ from expert conception and
affect how students understand scientific explanations (Hammer, D. (1996). Students
may have various, often pre-conceived misconceptions about electricity, which stand in
the way of learning. The most two resistant obstacles seem to be the conception to view a
battery as a source of constant current and to not consider a circuit as a system (Dupin &
Johsuam, 1987). Closset (1983) introduced the term sequential reasoning which appears
to be widespread among students (Shipstone, 1984). There is some evidence that
sequential reasoning at least partially id developed at school (Shipstone, 1988). and
reinforced by the teacher (Sebastia, 1993). Using the metaphor of a fluid in motion
(Pesman & Eryilmaz, 2010) and highlighting that electricty leaves the battery at one
terminal and goes to turn on the different components in the circuit successively does not
support students in order to view a circuit as a system (Brna, 1988). On the contrary, this
linear and temporal processing prevents students from making functional connections
between the elements of a circuit and from viewing the circuit structure as a unified
system (Heller & Finley, 1992). Surprisingly, research findings do not indicate a different
development of sequential reasoning according to age and teaching levels (Riley et al.,
1981). Similar conceptions are also hold by adults and some teachers (Bilal & Erol,
2009).
Therefore, there is need for diagnosis instrument to get informed about students
preconceptions and also for evaluating the physics classroom. In order to identify and
measure students misconceptions about electricity different approaches have been made.
In contrast to interviews, diagnostic multiple choice tests can be immediately scored and
applied to a large number of subjects. Pesman and Erylmaz (2010) used the three tier test
methodology for developing the SECDT (Simple Electric Circuits Diagnostic Test). As
ordinary multiple choice tests with one-tier were highly criticized in overestimating the
students right as well as wrong answers, two- and three-tier tests were developed by
researchers. Starting from an ordinary multiple choice question in the first tier, students
are asked about their reasoning in the second tier, and respectively, students estimate
their confidence about their answers in the third-tier. In view of lack of instruments for
testing electricity concepts of students at grade 7 and for being suitable to the Austrian
physics curriculum, the author already developed a diagnostic instrument with some twotier items for assessing students conceptual understanding as well as its potential use in
evaluating curricula and innovative approaches in physics education (Urban-Woldron &
Hopf, 2012).
102
Strand 11
circuit. In consequence, they often demonstrate local reasoning by only focusing their
attention on one specific point in the circuit and by ignoring what is happening elsewhere
in the circuit. Additionally, students show sequential reasoning, by which they believe
that when any dynamic change takes place in a circuit, only elements coming after the
specific point are affected. However, the focus of the study is on revealing students ideas
and explanations how electricity moves in a simple circuit including lamps and resistors
connected in series and in what way they justify their answers.
For gaining a correct vision of student understanding, it is crucial to get informed what
students actually do not know and what kind of alternative conceptions they have.
Therefore, also for the researcher the wrong answers and the associated explanations of
the students are much more interesting and usable than the correct answers.
Consequently, the context of this study is an extension to the development of an already
existing instrument for testing electricity concepts of students at grade 7 on two specific
aspects: first, to develop items for figuring out sequential reasoning, and second, to
distinguish between misconceptions and lack of knowledge. The following two broad
research questions were addressed:
1. Do misconceptions related to understanding an electric circuit as a system depend on
gender, level of education, respectively age, and/or the teacher?
2. Can a three-tier multiple choice test be developed that is reliable, valid, and uncovers
students misconceptions related to grasp an electric circuit as a system?
METHOD
In order to develop a reliable tool to identify students misconceptions related to
understanding an electric circuit as a system, the author first conducted interviews based
on literature review, both more structured ones and also with open-ended questions. In an
initial stage a 10-item questionnaire was developed, including 10 two-tier items (meaning
question plus follow-up question, an example is provided in figure 2).
In first round of evaluation with 10 teachers and 113 students (grade 8, 58 f(emale), 15
m(ale)), the questionnaire was reduced to 7 items, extending them with third tier asking
for students confidence when answering each question. After a test run with 339 students
of grade 7 to grade 12 from secondary schools across Austria after formal instruction
(183 f, 156 m, mean age 14.7 y(ears), S(tandard) D(eviation) 1.7 y) results were
evaluated with the software programs SPSS and AMOS. In a polishing round, additional
interviews were used to optimize the test items. To get the score for a two-tier item, a
value of 1 was assigned when both responses were correct. Furthermore, by examining
specific combinations of answers other relevant variables were calculated to address
students misconceptions.
103
Strand 11
Data Sources
Firstly, based on preliminary results gained from interviews with open-ended questions, a
questionnaire with ten two-tier items was developed and piloted with 113 students
accompanied by clarifying interviews to get deeper insights about the students cognitive
structures and reasoning. Consequently, four out of those ten items finally constituted the
test instrument used in this present study, assessing students understanding of the
systemic character of a simple electric circuit with three-tier items. In the following, the
author presents a three-tiered item (see figure 2), asking questions related to very simple
electric circuits; as we will see, there is ample space for misconceptions despite their
simplicity.
It has to be added here that the provided answers have not been thought up by the
researcher but are based both on literature review and actual experiences with students.
The first tier is a conventional multiple-choice question and the second tier presents some
reason for the given answer on the first tier. Additionally, to distinguish between
misconceptions and lack of knowledge, a third tier is implemented to examine how
confident students are about their answers.
104
Strand 11
rather certain
rather uncertain
highly uncertain
Data Analysis
Starting with descriptive analyses, analyses of variance, confirmatory factor analyses, and
regression analysis using the software SPSS and AMOS were conducted.
RESULTS
Obviously, the correct answer for item A (see figure 2) would be a1 and b3. 108 students,
that are 33.4%, provide a correct answer to the first two tiers of item A. A closer look to
the numbers in table 1 shows that 51.7% or 167 students actually answered the first tier
correctly, but 59 out of these 167 students or 35.3% provided a wrong reason.
Consequently, more than one third of the correctly responding students on the first tier
can be added to so-called false positives. On the other hand, 153 students chose the right
distractor for the explanation, whereas only 70.6% of these students also gave a correct
answer on the first tier. Therefore, we critically overestimate students knowledge if we
only look at one tier. Overall, 30 students are highly certain, 105 are rather certain, 88
are rather uncertain, and 100 are highly uncertain about their answers. 37% of the highly
certain students and 26% of the rather certain ones give the correct answer for the first
and the second tier, whereas only 8% of the highly uncertain students answer this item
105
Strand 11
correctly. Table 1 gives an overview of the three answer options a1, a2, and a3 and the
three associated alternatives b1, b2, and b3 for the reasoning.
Table 1
Distribution of answers and reasons for item A
106
Strand 11
are rather or highly uncertain about their answers may point to the assumption that they
simply guessed their answers.
Table 2
Possible combinations axby
Table 2 shows all nine possibilities for combinations of answers a1, a2 or a3 with
explanations b1, b2 or b3 within the red-framed rectangle. Four out of these nine
combinations can be attributed to specific physics concepts which can be described in the
following way: a1b3 indicates correct answer and correct reasoning, a1b2 stays for
correct answer and sequential reasoning, a2b2 hints at sequential reasoning, and a2b1
displays that the battery is viewed as a source of constant current.
Strand 11
Item A (see figure 2), 17.7% think sequentially and 11.8% think that the battery is a
source of constant current and therefore the resistor has no influence on the amount of
current in the circuit. All other possible combinations are subsumed under other,
depicted in yellow in figure 3. By looking at the yellow bars, it can be assessed that the
students tend to choose a combination which cannot be easily attributed to a specific
concept when they are not certain about their thinking and probably simply guess.
Strand 11
Construct validity was evaluated through factor analysis. Confirmatory factor analysis
with AMOS, using the maximum-likelihood-method and including specific combinations
of answers due to the first and second-tier of four different test items, resulted in a value of 5.805, which was not significant (p = .221). Therefore, a latent variable
sequential reasoning could be established (see figure 5). This variable can explain up to
52% of the variance of the single items.
A resistor and two lamps are connected to a battery.
a) What will happen to the brightness of the lamps if R is increased ?
L1 remains constant, L2 decreases.
L1 decreases, L2 remains constant.
The brightness of both lamps increases.
The brightness of both lamps decreases.
The brightness of both lamps remains constant.
b) How would you explain your reasoning?
A change of the resistor only influences the brightness of the lamp if the lamp is behind the resistor.
Any change of the resistor influences the brightness of both lamps.
It is the same battery. Therefore, the same current is delivered.
Both lamps have a direct connection to the battery. Therefore, the resistor has no effect on the
lamps.
c) Are you sure about your answer to the previous two questions?
highly certain
rather certain
rather uncertain
highly uncertain
Figure 6: Item D
Furthermore, findings from ANOVA reveal a main effect for correct answers concerning
all four items A to D on the particular school, respectively on the particular teacher.
Surprisingly, there are no dependences on students conceptions neither according to their
gender nor to their age.
Finally, regression analysis, where items A to C were used to predict sequential reasoning
for item D (see figure 6), suggests that those three factors together explain 31% of the
variance for item D (F (3, 338) = 49.89, p < .0001) and are significant individual
predictors of students sequential reasoning at item D.
109
Strand 11
REFERENCES
Bilal, E. and Erol, M. (2009). Investigating students conceptions of some electricity
concepts. Lat. Am. J. Phys. Educ., 3(2), 193-201.
Brna, P. (1988). Confronting misconceptions in the domain of simple electric circuits.
Instructional Science, 17, 29-55.
Closset, J.-L. (1983). Sequential reasoning in electricity. In: Proceedings of the
International Workshop on Research in Physics Education, La Londe les Maures,
Paris: Editions du CNRS.
110
Strand 11
Dupin, J.-J. and Johsua, S. (1987). Conceptions of french pupils concerning electric
circuits: Structure and evolution. Journal of Research in Science Teaching, 24(9),
791806.
Hammer, D. (1996). More than misconceptions: Multiple perspectives on student.
knowledge and reasoning, and an appropriate role for educational research. American
Journal of Physics, 64(10), 1316-1325
Heller, P. M. and Finley, F. N. (1992). Variable uses of alternative conceptions: A case
study in current electricity. Journal of Research in Science Teaching, 29(3), 259-275.
McDermott, L.C. & Shaffer, P.S. (1992). Research as a guide for curriculum
development: An example form introductory electricity. Part I: investigation of
student understanding. American Journal of Physics, 60(11), 994-1003.
Pesman, H. and Eryilmaz, A. (2010). Development of a three-tier test to assess
misconceptions about simple electric circuits. Journal of Education Research,
103:3,208-222.
Riley M. S., Bee, N. V. and Mokwa, J. J. (1981). Representations in early learning: the
acquisition of problem-solving strategies in basic electricity and electronics. In:
Proc. Int. Workshop on Problems Concerning Students Representations of Physics
and Chemistry Knowledge, Ludwigsburg (Ludwigsburg, West Germany:
Paedagogische Hochschule) 107-173.
Rosencwajg, P. (1992). Analysis of problem solving strategies on electricity problems in
12 to 13 year olds. European Journal of Psychology of Education, VII,1, 5-22.
Sebastia, J. M. (1993). Cognitive mediators and interpretations of electric circuits. In:
The Proceedings of the Third International Seminar on Misconceptions and
Educational Strategies in Science and Mathematics, Misconceptions Trust: Ithaca,
NY (1993).
Shipstone, D. M. (1988). Pupils understanding of simple electric circuits: Some
implications for instruction. Phys. Educ. 23, 92-96.
Shipstone, D. M. (1984). A study of childrens understanding of electricity in simple DC
circuits. Eur. J . Sci. Educ. 6, 185-198.
Urban-Woldron, H. & Hopf, M. (2012). Developing a multiple choice test for
understanding basic electricity. ZfDN, 18, 201-227.
111
Strand 11
112
Strand 11
113
Strand 11
Methods
Tasks
For the comparison of the product- und process-based analysis we developed two experimental
tasks for the domain of electric circuits in secondary school curricula (Schreiber et al., 2012). The
first task is Here are three bulbs. Find out the one with the highest power at 6 V. In the second
task the students get a set of wires and have to find the best conductor from three metals. Students
have a set of apparatus and a pre-structured lab sheet at their disposal. The lab sheet is structured
along our model of experimental skills, requesting to plan a suitable experiment, assemble the
experimental setup, perform measurements, evaluate the data and draw conclusions. Both tasks
are open-ended and the students have to structure their paths towards the solutions on their own.
They are only assisted by written information on the necessary physics content knowledge.
Design
Table 1 shows the design of the study. It was embedded in a more extensive study concerning the
comparison of different assessment tools for experimental skills (Schreiber 2012).
114
Strand 11
Table 1
Design of the study
pre-test cognitive skills, content knowledge, selfconcept
45 min
20 min
Hands-on tests
group 1 and task 1
highest power
30 min
138 upper secondary students, aged about 16 to 17, took part in this study. In a pre-test we measured personal variables that are supposed to have an influence on students test performances:
cognitive skills, self-concept concerning physics and experimenting in physics, and the content
knowledge in the field of electricity. Established tests and questionnaires were adapted for this
pre-test (Heller & Perleth 2000, Engelhardt & Beichner 2004, von Rhneck 1988, Brell 2008). In
the hands-on test the students worked on one of the two tasks described above. The use of two
different tasks was due to the design of the more extensive project into which this study was embedded. The students were assigned to the two groups based on their pre-test results in such a
way that a sufficient and similar variance of the personal variables was realized in both groups.
In a training session the students were introduced to the hands-on test (structures of the tasks and
handling of the devices). The training task was also taken from the domain of electric circuits
(measuring the current-voltage characteristic of a bulb).
In the hands-on test, students worked with a set of electric devices and a pre-structured lab sheet
(Figure 2, task 1). In the situation shown in Figure 2, the student documents his (inadequate)
setup with two multimeters, a battery and a bulb in the lab sheet. The pre-structured lab sheet
demands to clarify the question, to document the setup, to perform measurements and to interpret
the results. Students can choose when and in which order they fill in the sheet. The lab sheet does
not specify a particular solution or approach.
Students actions were videotaped and the lab
sheets were collected.
Process-oriented analysis
The videos and the lab sheets were analysed according to the components of experimenting
shown in Figure 1. The process-oriented analysis
leads to a quality index for each student in each
of these assessment categories. In a first step
students actions in the videotape are assigned to
one of the six components. A second step of
analysis codes the qualities of intermediate stages (e.g. whether an experimental setup is cor115
Strand 11
rect, imperfect or wrong) and the development (e.g. whether an imperfect setup is detected and
improved) (cf. Schreiber, Theyen & Schecker 2012, Theyen et al. 2013). The flow chart in
Figure 3 illustrates an example of how the rating decisions are made. The result is a quality index
on an ordinal scale with five levels. To secure validity and reliability of this analysis, several
studies with high inferent expert-ratings and interviews were conducted (details: Dickmann 2009,
Hollnder 2009, Fichtner 2011, Dickmann, Schreiber & Theyen 2012, Schreiber 2012). The
evaluation of double coding yields a high objectivity of the ratings (Cohens Kappa .67).
Figure 3: Formal analysis scheme of the sequence analysis specified for setup skills.
Product-oriented analysis
For the product-oriented analysis only the students documentations in the lab sheets were analysed with regard to the same six model components (skill in Figure 1). Each entry in the lab
sheet is directly associated with an assessment category. The single criterion is the correctness of
the entry. A development cannot be assessed since in most cases only one result is documented in
the sheets. Thus, using the formal analysis scheme (Figure 3), in the product-oriented analysis
only the levels 1, 2, and 5 can be scored. Again the objectivity in each assessment category is satisfying (Cohens Kappa >.62).
RESULTS
To test the hypotheses, rank correlations (Kendall-Tau b, ) between the quality parameters from
the product-oriented and the process-oriented analysis were calculated for each category (Tab. 2).
In all the four assessment categories that can be assigned to the preparation and the evaluation
dimensions, the correlations are high (
. For components of the performance dimension,
we found only medium or low correlations (
. Thus, both hypotheses can be confirmed.
The high correlations in the planning and data evaluation dimensions can be explained by the
data basis: In these dimensions the process-oriented analysis also refers mainly to the documentations in the lab sheets. Only in a few cases the videos provided further information concerning
116
Strand 11
developments. Thus,
regardless of the
Correlations (Kendall-Tau b, ) between a product-oriented and a pro- method of analysis,
in the dimensions of
cess-oriented analysis. The correlations are highly significant (**) or
*
planning and data
significant ( ). The assessment categories are assigned to the three experimental dimensions: preparation, performance and data evaluation. evaluation the scores
1, 2 and 5 dominate.
n: sample size.
In contrast, in both
dimension
assessment categories
n
assessment categories
A further result can be derived from Table 2: the sample size per category decreases over the
course of the experiment. Whereas 138 students clarified the questions and created an experimental design in the beginning, only 117 students interpreted the results in the end. This is a noticeable dropout of about 15 %. The reason is the use of an open task format (Fig. 2). The students had to structure the approach on their own and without any assistance. Students who e.g.
did not complete the setup were in the following not able to measure and to document data.
CONCLUSIONS
We draw two conclusions from our results:
1. Comparison of a process-oriented and a product-oriented analysis
A product-oriented analysis seems to be sufficient to analyse students skills of preparing an experiment and evaluating data. But in order to account for performance skills adequately, hands-on
tests with a process-oriented analysis of students actions seem to be necessary. These findings
should be considered for the development of more valid assessment procedures.
2. Open task format and sample size
The use of an open task format in testing experimental skills (Find out ) causes a noticeable
dropout of students during the test. For assessing the full range of experimental skills, we suggest
a guided test with non-interdependent sub-tasks. Each sub-task should refer to a specific experimental skill. To allow for a non-interdependent assessment, the item should present a sample solution of the preceding step, e.g. a measurement-item should provide a complete experimental
setup. We have started to work on such a test format.
117
Strand 11
REFERENCES
American Association for the Advancement of Science (AAAS) (Ed.) (1993). Benchmarks for
Science Literacy. New York: Oxford University Press.
Brell, C. (2008). Lernmedien und Lernerfolg - reale und virtuelle Materialien im Physikunterricht. Empirische Untersuchungen in achten Klassen an Gymnasien zum Laboreinsatz mit
Simulationen und IBE. In H. Niedderer, H. Fischler & E. Sumfleth (Eds.), Studien zum
Physik- und Chemielernen, Vol. 74. Berlin: Logos.
Department for Education and Employment (DEE) (Ed.) (1999). Science - The National Curriculum for England. London: Department for Education and Employment.
Dickmann, M. (2009). Validierung eines computergesttzten Experimentaltests zur Diagnostik
experimenteller Kompetenz (unpublished bachelor thesis). Dortmund: Technische Universitt Dortmund.
Dickmann, M., Schreiber, N. & Theyen, H. (2012). Vergleich prozessorientierter Auswertungsverfahren fr Experimentaltests. In S. Bernholt (Ed.), Konzepte fachdidaktischer Strukturierung fr den Unterricht, (pp. 449451). Mnster: LIT.
Emden, M. & Sumfleth, E. (2012). Prozessorientierte Leistungsbewertung des experimentellen
Arbeitens. Zur Eignung einer Protokollmethode zur Bewertung von Experimentierprozessen.
Der mathematische und naturwissenschaftliche Unterricht (MNU), 65 (2), 68-75.
Engelhardt, P. V. & Beichner, R. J. (2004). Students understanding of direct current resistive
electrical circuits. American Journal of Physics 72 (1), 98115.
Fichtner, A. (2011). Validierung eines schriftlichen Tests zur Experimentierfhigkeit von
Schlern (unpublished master thesis). Bremen: Universitt Bremen.
Garden, R. (1999). Development of TIMSS Performance Assessment Tasks. Studies in Educational Evaluation 25(3), 217241.
Gut, C. (2012). Modellierung und Messung experimenteller Kompetenz. Analyse eines largescale Experimentiertests. In H. Niedderer, H. Fischler & E. Sumfleth (Eds.), Studien zum
Physik- und Chemielernen, Vol. 134. Berlin: Logos.
Hammann, M. (2004). Kompetenzentwicklungsmodelle: Merkmale und ihre Bedeutung dargestellt anhand von Kompetenzen beim Experimentieren. Der mathematische und naturwissenschaftliche Unterricht 57(4), 196203.
Heller, K. A. & Perleth, C. (2000). Kognitiver Fhigkeitstest fr 4.-12. Klassen, Revision (KFT
4-12+ R). Gttingen: Hogrefe.
Hollnder, L. K. (2009). Validierung eines Experimentaltests mit Realexperimenten zur Diagnostik experimenteller Kompetenz (unpublished bachelor thesis). Dortmund: Technische
Universitt Dortmund.
Klahr, D. & Dunbar, K. (1988). Dual Space Search During Scientific Reasoning. Cognitive Science 12, 148.
Neumann, K. (2004). Didaktische Rekonstruktion eines physikalischen Praktikums fr Physiker.
In H. Niedderer, H. Fischler & E. Sumfleth (Eds.), Studien zum Physik- und Chemielernen,
Vol. 38. Berlin: Logos.
118
Strand 11
119
Strand 11
INTRODUCTION
In Swiss schools, science usually is taught as one subject on the lower secondary level.
Accordingly, the national education standards do not differentiate between biology,
chemistry, and physics. Within the Swiss project HarmoS (Labudde et al., 2012) an
interdisciplinary structure model of scientific competence was developed. The validation
of the model by a large-scale hands-on test showed that the progression of experimental
competence cannot be explained post hoc (Gut, 2012). Based on these results, we
developed a new interdisciplinary normative progression model for experimental
competence for practical assessments. In this paper, the model and first results of the
validation by pilot assessments are presented.
120
Strand 11
RATIONALE
According to Schecker & Parchmann (2006) there are different purposes of normative
competence models: Such models should help to define competence by determining
structure and progression. They should be practical in formulating adequate standards for
experimenting in school science and useful for teacher education. They should also provide
an appropriate basis for the development of valid and reliable assessments of students
performance (e.g. Kauertz et al., 2012; Lunetta et al., 2007). In order to attain these goals,
four kinds of a priori decisions have to be made when experimental competence is
modelled. First, one has to define which scientific problems are to be standardised and
assessed. Second, one has to decide how competence may be decomposed in subdimensions. In the often-used process approach, problem solving is conceived as a linear
chain of processes (Murphy & Gott, 1984), such as formulating a hypothesis, planning and
carrying out experiments, and analysing data (Emden & Sumfleth, 2011). Alternatively,
one can differentiate between types of problems. The solution of each type demands
specific knowledge and skills (Millar et al., 1996). Therefore, each solution is scored by
typical scoring schemes and criteria (Ruiz-Primo & Shavelson, 1996). Thirdly, it has to be
decided whether the progression of competence is modelled in terms of task complexity
(e.g. Wellnitz et al., 2012), or in terms of quality criteria that standardised problems are
solved with (Millar et al., 1996), or in terms of both simultaneously. The fourth decision
concerns the kind of assessment (hands-on, simulation, or paper and pencil test) by which
the competence is to be measured. Of course, these four decisions are not independent of
each other. The process approach leads to a restricted view of experimental activities,
excluding engineering tasks for instance. Also, considering the variety of problem types,
the complexity cannot be empirically explained based upon one single progression model
(Gut, 2012).
quality stanchoice of
instruments dard achieved
measurement
repetition
basic
problem
IV
III
II
single
measurement
I
documentation
of results
sources of
measurement error
measurement
repetition
LEVEL
Strand 11
instrument
selection
Figure 1: Progression model for the problem type measurement with a given scale; the
hierarchy of quality standards is set a priori.
METHOD
For the first internal validation six tasks have been developed: three tasks for categories
conducted observation and three tasks for measurement with a given scale within
selected topics in biology, chemistry, and physics. An example for the problem type
measurement with a given scale in physics is the task thread, where students should
find out at what force a thread breaks. They get a thread, scissors, two different spring
scales (A and B) and a calculator. Students are asked to think about how they could answer
the question, with which instrument they can do it best and how many measurements
would be necessary. After finding the result they have to draw or write down, how they did
the measurement. In addition they are explicitly asked to tell with which instrument they
did the measurement. Afterwards they are asked to estimate how exact the measurement is
and how they could improve it. At the end, there are some control questions, such as which
spring scale they used or if they calculated a mean and how they did it.
The pilot assessment has been administered to 250 students of different grades (7, 8, and 9)
and levels of the lower secondary school, especially to low achievers. In the following,
low level indicates the lowest achievement level, whereas high level indicates the
highest level of the so called Sekundarschule (in Switzerland, Gymnasium is even
higher). The students had to solve three tasks, each in 20 minutes. They worked on their
own with printed test sheets on which they were asked to write down their answers and a
brief report. Each task was coded by minimum two persons and as one single item, i.e. all
answers have been evaluated as a whole: Several dichotomous quality criteria could be
122
Strand 11
achieved or not. These quality criteria were clustered to the quality standards (see figure 1)
which could be achieved or not. To achieve a quality standard 50% or more of the quality
criteria had to be achieved.
The item score could be measured in two ways: On the one hand one can sum all achieved
quality standards (unconditional level score: uLev). On the other hand one can use the
hierarchy of quality standards and set the item score to that level, where all lower quality
standards are achieved (conditional level score: cLev). E.g. if a student achieves quality
standards 1, 2, 3 and 5 in a task, his unconditional level score would be uLev = 4 and his
conditional level score would be cLev = 3. The advantage of conditional levelling is to get
back information, which is lost by summing up scores.
RESULTS
The results of the first pilot assessment affirm that the progression model for the two
problem types categories conducted observation and measurement with a given scale
can be applied reasonably to all three science subjects. 1-dimensional Rasch analyses (with
the program Winsteps) for each problem type show good item fits. For conditional level
score we found for the three tasks of categories conducted observation .96 < infit < 1.13
and .82 <outfit < 1.24 and for the three tasks of measurement with a given scale .81 <
infit < 1.22 and .76 <outfit < 1.32. At least for the measurement with a given scale
sufficiently high reliability (> .6) is achieved.
Strand 11
0,80
0,60
0,40
0,20
0,00
single
documentation
multiple
measurement of results measurement
instrument
selection
sources of
measurement
error
Figure 2: Frequencies of achieved quality standards for the task thread, differentiated
in four groups (low and high levels of grade 7 and 9 of the Sekundarschule).
Figure 2 shows some misfits of the model: First, frequencies of every group should be
smaller the higher (i.e. the more right on the x-axis) a quality standard is. Secondly,
frequencies of high level groups of the same grade and also higher grade groups of the
same level should be higher for every quality standard. Not each task shows the same
misfits as shown in figure 2. But generally the differentiation and the hierarchy of level 3
(multiple measurement) and 4 (instrument selection) is critical. More tests have to show, if
these two levels have to be changed or put together.
Validation of tasks
The tasks can be standardised well with respect to the structure of the test sheet, the
question formats, and textual demands. All tasks were coded by minimum two persons. On
the level of quality criteria as well as on quality standards a high interrater correlation (>.8)
can be achieved. Nevertheless, for categories conducted observation tasks more rater
training is needed than for measurement with a given scale tasks.
Separate 1-dimensional Rasch analyses with unconditional and with conditional levels
show high correlations between the two scoring alternatives (categories conducted
observation: .940**; measurement with a given scale: .847**). Therefore, in order to
gain more information it seems to be reasonable to work with the unconditional levels
instead of the conditional ones.
124
Strand 11
Students performance
To make the results comparable, we transformed the item parameters to skill points. Based
on PISA and other large scale assessments, we set the mean of all students of grade 9 to
500 with a standard deviation of 100. Table 1 presents the results for the tasks of the
problem types categories conducted observation and measurement with a given scale.
For categories conducted observation, increases from grade 7 to 9 (high level) and from
low to high level in grade 9 are significant with a value of almost a standard deviation. All
increases from grade 7 to 9 (both levels) and from low to high level (in grade 7 and 9) are
significant with a value of almost a standard deviation.
Table 1
Students performance for tasks of the problem types categories conducted observation (left)
and measurement with a given scale (right). In each case is given the mean and the standard
deviation in parenthesis. U stands for Mann-Withney-U-test, t for t-test and n.s. for not
significant.
categories
conducted
observation
low level
high level
measurement with a
given scale
low level
high level
grade 7
437 (79)
n.s.
452 (93)
grade 7
365 (101)
U: .000
450 (107)
n.s.
U: .006
U: .000
grade 9
459 (86)
537 (99)
grade 9
458 (83)
t: .011
U: .008
t: .009
538 (101)
With these results we can assign the students to the levels shown in table 2.
Table 2
Allocation of students performance to quality standard levels for tasks of the problem types
categories conducted observation (left) and measurement with a given scale (right).
categories
conducted
observation
low level
high level
measurement with a
given scale
low level
high level
grade 7
grade 7
II
grade 9
II
grade 9
II
IV
For categories conducted observation, students in grade 7 and in the low level of grade 9
achieve level I, i.e. they observe a phenomenon correct and complete. Only high level
students of grade 9 achieve level II, i.e. they also identify differences between two
125
Strand 11
observations. For measurement with a given scale, low level students in grade 7 achieve
level I, i.e. they measure correct. High level students of grade 7 and low level students of
grade 9 achieve level II, i.e. they also prepare a well documentation of results. Finally, high
level students of grade 9 attain level IV, i.e. in addition they repeat measurements and
select the right instruments.
REFERENCES
Emden, M. & Sumfleth, E. (2012). Prozessorientierte Leistungsbewertung. Zur Eignung
einer Protokollmethode fr die Bewertung von Experimentierprozesse. Der
mathematisch-naturwissenschaftliche Unterricht, 65 (2), 68-75.
Gut, C. (2012). Modellierung und Messung experimenteller Kompetenz. Analyse eines
large-scale Experimentiertests. Berlin: Logos.
Kauertz, A., Neumann, K. & Hrtig, H. (2012). Competence in Science Education. In: B. J.
Fraser, K. Tobin & C. J. McRobbie (Eds.). Second international handbook of science
education (pp. 711-721). Berlin: Springer.
Labudde, P., Nidegger, C., Aamina, M. & Gingins, F. (2012). The Development,
Validation, and Implementation of Standards in Science Education: Chances and
Difficulties in the Swiss Project HarmoS. In: S. Bernholt, K. Neumann & P. Nentwig
(Eds.). Making it tangible: Learning outcomes in science education (pp. 235-259).
Mnster: Waxmann.
Lunetta, V. N., Hofstein, A. & Clough, M. P. (2007). Learning and Teaching in the School
science Laboratory: An Analysis of Research, Theory, and Practice. In: S. K. Abell
& N. G. Lederman (Eds.). Handbook of research on science education (pp. 393441). Mahwah: Erlbaum.
Millar, R. & Driver, R. (1987). Beyond processes. Studies in Science Education, 14, 33-62.
126
Strand 11
127
Strand 11
INTRODUCTION
Individualised teaching has been featured in most education acts of Germany since
2005. The implementing of individualised teaching seems to be problematic: firstly,
there is no consensus about the definition of individualised teaching in literature;
secondly, there is insufficient learning material especially in the field of science
education. Further problems are caused by the general framework of the German
school system: teachers do not have time to create new learning material, and even
less in an individualised way for each student. Beyond that, research on individualised
teaching reveals that those methods increase the learning outcomes of high achieving
students only. To sum up, more research concerning the effectiveness of
individualised teaching methods, and ways to support the learning process of every
student is needed. In addition, learning material for individualised teaching has to be
developed and evaluated. This study was aiming at two aspects: firstly, it developed
and evaluated a self-evaluation sheet as a diagnostic instrument and a newly
constructed learning unit for individualised chemistry education, in which the work
with self-evaluation is embedded; secondly, it explored the effects of this diagnostic
tool on learning achievements. The second aspect will be dealt with in more depth.
RATIONALE
In the beginning of the 20th century, individualised teaching was requested by
representatives of the progressive education (e. g. Isaacs, 2010). After the discussion
fell silent for some time, it has been reopened by the implementation of individualised
128
Strand 11
teaching in the education acts of Germany in 2005. This has initiated the debate about
teaching methods and effects on academic achievement in the context of
individualised teaching.
There are various definitions of individualised teaching in literature. This paper uses
the following definition according to Kunze (2009) and Trautmann and Wischer
(2008): Individualised teaching consists of two aspects. The first aspect is the
diagnosis of students needs and knowledge concerning a specific topic. The second
aspect is, based on the results of the diagnosis, the implementation of differentiating
learning environments. The aim of individualised teaching is to support each student's
skills specifically. This can range from closing knowledge gaps to promoting
individual strengths.
Research on individualised teaching methods and differentiating learning
environments seems to demonstrate that only high achievers profit from these
methods and that there is no influence on low and average achieving students (e. g.
Bode, 1996; Baumert, Roeder, Sang, & Schmitz, 1986; Helmke, 1988). In addition,
research indicates that individualised instruction has a small effect on students
achievements (Hattie, 2012). Some researchers argue that the implementation of
individualised teaching is highly time-consuming (e. g. Gruehn, 2000), and as a
consequence of this, there is less time for work on task. Furthermore, the concept of
developing individualised learning environments for each student in class is not
realistic. In conclusion, more research concerning the effectiveness of individualised
teaching and ways for implementation are needed.
One aim of this study was to develop an individualised learning unit, which can be
adapted in chemistry education in an efficient way. Therefore, the theory of
individualised teaching and the theory of self-regulated learning had been combined.
According to Zimmerman (2002), self-regulated learners are active participants in
their own learning process. They are aware of their strengths as well as of lacks of
knowledge and they are self-motivated to achieve desired goals (Zimmerman, 1990).
Zimmermans model, which relies on the social-cognitive perspective, defines selfregulated learning as a cyclical process consisting of three important phases:
forethought, performance and the reflection phase. These phases are divided into
different categories. In the forethought phase, students plan their learning process, set
the goals they want to achieve, and they are self-motivated. This procedure is
influenced by students self-efficacy beliefs and experiences. The performance phase
is characterised by self-monitoring processes in order to regulate the actual learning
behaviour. After that, the learning process is evaluated and reflected in the reflection
phase. Students judge their achievements and find causal attributions. According to
this reflection, self-regulated learners set new goals and proceed with the forethought
phase (Zimmerman, 2002). To sum up, self-regulated learning is context-specific and
depends on students metacognitive, motivational and cognitive abilities
(Zimmerman, 2005). There is evidence that self-regulated learning can lead to higher
learning outcomes (Perels, Dignath, & Schmitz, 2009; Zimmerman & Martinez-Pons,
1988). From the social-cognitive perspective, self-regulated learning is not a static
trait. Research indicates that it can be trained and needs to be adopted to many
different situations (Perels, Grtler, & Schmitz, 2005). Some empirical studies focus
their trainings for self-regulated learning on selected categories of the three phases, as
for example self-monitoring. In this context, results reveal that consistent selfmonitoring leads to higher self-efficacy beliefs and better academic performance
(Schunk, 1982/1983). Besides, the accuracy of self-evaluation and self-assessment
129
Strand 11
seems to be a helpful predictor for learning success (Chen, 2003; Kostons, van Gog,
& Paas, 2012).
This study used a self-evaluation sheet to combine the theory of individualised
teaching with the theory of self-regulated learning. The work with the self-evaluation
sheet was embedded in an individualised learning unit and focused on the following
aspects of the theories: Firstly, students evaluate their own performance and use the
self-evaluation sheet as a diagnostic tool. Secondly, it enables students to plan,
monitor and document their own learning process. Thirdly, learners autonomously
record the learning behaviour on the sheet, which is important for a funded selfreflection.
RESEARCH QUESTIONS
Based on the theoretical background, this study addressed the following research
questions:
Question 1: Does self-evaluation have an effect on learning outcomes?
Question 2: Does self-evaluation have a long-term effect on learning outcomes?
Question 3: Does self-evaluation have an effect on students feelings towards the
individualised teaching unit?
METHODS
Materials
The materials used in the learning unit were constructed concerning the topic
chemical reaction. They consist of the self-evaluation sheet, problem sheets,
information texts, information cards and model answers. The self-evaluation sheet is
structured in tabular form. Nine statements about the students abilities are listed,
written in the first person singular. Each statement covers one subtopic of the topic
chemical reaction. Students assess themselves on a four-point Likert scale going from
I am very confident to I am not confident at all. On the self-evaluation sheet, there
are direct links to the problem sheets and information texts, which deal with the
particular subtopic. Besides, there is space to document what material has been used.
Students are encouraged to record their learning progress by evaluating themselves
again after a while, this time using a pen in a different colour. For those students who
are very confident in every subtopic, challenging exercises are provided. Students
work autonomously with the learning material. They decide in what order they use it.
Feedback towards the correctness can be obtained through model answers.
A multiple-choice test was developed to assess students chemistry performance
regarding the topic chemical reaction (35 items with five alternatives, Cronbachs
alpha = .85). Each statement on the self-evaluation sheet was covered by three or four
items of the multiple-choice test. Pre-, post- and follow-up-tests consisted of identical
items but differed in the order of the latter. Students feedback towards the learning
unit was assessed by conducting a feedback questionnaire (17 items with a five-point
Likert scale going from 1 (completely agree) to 5 (completely disagree),
Cronbachs alpha = .89). Some of these items have been adapted from the
130
Strand 11
Participants
Students attending the seventh-grade of upper secondary schools in Germany
participated in this study (N = 234). Data sets of 218 students could be used in the
pre/post data analyses. Because of an additional drop-out, 207 data sets were
integrated in the pre/follow-up analyses. The mean age of the sample was 13.23 years
(SD = 0.43), and the percentage of male students was 57.3 %.
Design
To answer the research questions, two experimental groups were created: The selfevaluation group worked with the self-evaluation sheet and the learning material. The
control group did not work with the self-evaluation sheet, but with the same learning
material. In order to understand the structure of the material, this group got a short
overview of the subtopics the material deals with.
One week before the unit, chemistry performance, intellectual performance and
academic self-concept were assessed (ca. 60 minutes). Based on matched-pairs,
students were assigned to the two learning conditions. 108 students worked in the
control group and 110 in the self-evaluation group. Both groups worked
simultaneously with the learning material in two different rooms. Each group was
guided by one experimenter. The learning unit took 90 minutes and was started by a
10-minute introduction. In order to analyse the learning behaviour, 72 randomly
chosen students were videotaped during the lessons, and the learning material of all
students that worked in the unit had been scanned (n = 230). One week after the unit,
chemistry performance was measured again. Additionally, differential self-concept
and feedback towards the unit were assessed (ca. 45 minutes). Four weeks after the
post-test, data on chemistry performance was collected for the third time (follow-uptest, ca. 30 minutes).
RESULTS
For the group comparisons, a residual analysis was done. With regard to the results of
the pilot study, it was expected that the work with the self-evaluation sheet would lead
to higher learning outcomes directly after the unit and one month later. The analysis
of the residues of pre- and post-test revealed a significant difference between the
groups, tpre-post(216) = 2.53, p = .012, d = 0.35. The self-evaluation group scored
significantly higher (M = 0.17, SD = 0.91) than the control group (M = - 0.17, SD =
1.06). Exploring a long-term effect, the results of the residual analysis of pre- and
follow-up-test indicated similar findings, tpre-follow-up(205) = 2.14, p = .033, d = 0.29,
with better learning outcomes of the self-evaluation group (M = 0.14, SD = 1.00) than
131
Strand 11
the control group (M = - 0.15, SD = 0.98). Besides, it was expected that students
feelings towards the unit are more positive in the self-evaluation group. The results of
the analysis of the feedback questionnaire demonstrated that there is a significant
difference, tfeedback(214) = 3.13, p = .002, d = 0.50). Students of the self-evaluation
group had more positive feelings towards the lesson (M = 1.75, SD = 0.56) than
students of the control group (M = 2.00, SD = 0.65).
With regard to the findings of empirical research, interaction effects between group
and cognitive performance level were analysed. The results of the one-way ANOVA
on pre-post residues revealed a significant main effect of the cognitive performance
level, F(2, 215) = 4.52, p = .012, with higher learning outcomes of the high achieving
students (M = 0.29, SD = 0.77) than of the average (M = - 0.15, SD = 0.98) and low
achievers (M = - 0.16, SD = 1.14). As the t-test already indicated, the main effect of
the learning condition is also significant, F(1, 216) = 5.23, p = .023, with students of
the self-evaluation group scoring higher than students of the control group. The
interaction of group and cognitive performance level was not significant, F(2, 215) =
0.37, p = .693.
132
Strand 11
Future research might focus on the questions, how to assess self-regulated learning in
this context and if the work with self-evaluation sheets can empower students to be
more self-regulated.
It can be concluded that the work with the self-evaluation sheet could be an effective
way to implement individualised teaching methods in chemistry education. The focus
on students self-responsibility might be one possibility to develop and implement
individualised teaching methods in chemistry education. It could be shown that the
self-evaluation sheet supports every student regardless of ones cognitive performance
level.
REFERENCES
Baumert, J., Roeder, P. M., Sang, F., & Schmitz, B. (1986). Leistungsentwicklung
und Ausgleich von Leistungsunterschieden in Gymnasialklassen [Development of
achievement and compensation of differences in achievements of upper secondary
students]. Zeitschrift fr Pdagogik, 32(5), 639660.
Bode, R. K. (1996). Is it ability grouping or the tailoring of instruction that makes a
difference in student achievement? Paper presented at the annual meeting of the
American Educational Research Association, New York.
Gruehn, S. (2000). Unterricht und schulisches Lernen: Schler als Quellen der
Unterrichtsbeschreibung [Education and academic achievement: Students as
sources to describe education in school]. Mnster u. a.: Waxmann Verlag.
Hattie, J. (2012). Visible learning for teachers: Maximizing impact on learning.
London and New York: Routledge.
Helmke, A. (1988). Leistungssteigerung und Ausgleich von Leistungsunterschieden
in Schulklassen: unvereinbare Ziele? [Increase of achievement and compensation
of differences in achievements: contradictory aims?] Zeitschrift fr
Entwicklungspsychologie und Pdagogische Psychologie, 20(1), 4576.
Isaacs, B. (2010). Bringing the Montessori Approach to your Early Years Practice.
London and New York: Routledge.
Kunze, I. (2009). Begrndungen und Problembereiche individueller Frderung in der
Schule - Vorberlegungen zu einer empirischen Untersuchung [Reasons and
problems of individualised teaching - Preliminary considerations of an empirical
study]. In I. Kunze & C. Solzbacher (Eds.), Individuelle Frderung in der
Sekundarstufe I und II [Individualised teaching in secondary schools] (pp. 1326).
Baltmannsweiler: Schneider-Verlag Hohengehren.
Perels, F., Dignath, C., & Schmitz, B. (2009). Is it possible to improve mathematical
achievement by means of self-regulation strategies? Evaluation of an intervention
in regular math classes. European Journal of Psychology of Education, 24(1), 17
31.
Perels, F., Grtler, T., & Schmitz, B. (2005). Training of self-regulatory and problemsolving competence. Learning and Instruction, 15(2), 123139.
Rheinberg, F., Vollmeyer, R., & Burns, B. D. (2001). QCM: A questionnaire to assess
current motivation in learning situations. Retrieved from http://www.psych.unipotsdam.de/people/rheinberg/messverfahren/FAMLangfassung.pdf
Rost, D. H., Sparfeldt, J. R., & Schilling, S. R. (2007). DISK-Gitter mit SKSLF-8.
Differentielles Schulisches Selbstkonzept-Gitter mit Skala zur Erfassung des
133
Strand 11
Selbstkonzepts schulischer Leistungen und Fhigkeiten [Differential academic selfconcept test with one scale for assessing self-concept regarding academic
achievement and academic skills]. Gttingen: Hogrefe.
Schne, C., Dickhuser, O., Spinath, B., & Stiensmeier-Pelster, J. (2002). SESSKO.
Skalen zur Erfassung des schulischen Selbstkonzepts [Scales for assessing
academic self-concept]. Gttingen: Hogrefe.
Schunk, D. H. (1982/1983). Progress Self-Monitoring: Effects on Children's SelfEfficacy and Achievement. Journal of Experimental Education, 51(2), 8993.
Trautmann, M., & Wischer, B. (2008). Das Konzept der Inneren Differenzierung eine vergleichende Analyse der Diskussion der 1970er Jahre mit dem
aktuellen Heterogenittskurs [The conception of internal differentiation - a
comparative analysis of the discussion in the 1970's about current beliefs of
hetereogeneity]. In M. A. Meyer, M. Prenzel, & S. Hellekamps (Eds.), Zeitschrift
fr Erziehungswissenschaft: Sonderheft 9. Perspektiven der Didaktik (pp. 159
172). Wiesbaden: VS Verlag fr Sozialwissenschaften.
Wei, R. H. (1998). Grundintelligenztest Skala 1 (CFT 20) [Standard intelligence test
scale 1 (CFT 20)]. Gttingen: Hogrefe.
Zimmerman, B. J. (1990). Self-Regulated Learning and Academic Achievement: An
Overview. Educational Psychologist, 25(1), 317.
Zimmerman, B. J. (2002). Becoming a Self-Regulated Learner: An Overview. Theory
into Practice, 41(2), 6470.
Zimmerman, B. J. (2005). Attaining Self-Regulation: A social cognitive perspective.
In M. Boekaerts, P. R. Pintrich, & M. Zeidner (Eds.), Handbook of self-regulation
(pp. 1339). Burlington, San Diego, London: Elsevier Academic Press.
Zimmerman, B. J., & Martinez-Pons, M. (1988). Construct validation of a strategy
model of student self-regulated learning. Journal of Educational Psychology, 80(3),
284290.
134