Assessment

THEORY OF MEASUREMENT:
Everything You Wanted To Know About Classroom Assessment But Were Afraid To Ask
Alexander Beaujean William Shiu

Baylor Psychometric Laboratory
http://homepages.baylor.edu/psychometric_lab
2008 Teaching Colloquy Department of Religion
TABLE OF CONTENTS
Definitions Test Design Test Score Properties: Reliability and Validity Cognitive Processes Some Item Types Developing the Test Take Home Message
2008 Teaching Colloquy, Department of Religion
Definitions
DEFINITIONS

Test
(noun) Etymology:
Middle English, vessel in which metals were assayed [analyzed], potsherd, from Anglo-French test, tees pot, Latin testum earthen vessel; akin to Latin testa earthen pot, shell
Definition:
(1): a procedure, reaction, or reagent used to identify or characterize a substance or constituent (2): something (as a series of questions or exercises) for measuring the skill, knowledge, intelligence, capacities, or aptitudes of an individual or group
(Test. (2008). In Merriam-Webster Online Dictionary. Retrieved September 26, 2008, from http://www.merriam-webster.com/dictionary/test)
DEFINITIONS
(Achievement) Test:
A collection of items or tasks used to measure a underlying construct of interest, the results (i.e., test scores) of which allows for decisions based on the construct's level
DEFINITIONS
Item:
Genesis is the first book of the Bible. T/F
Item Stem
Item Response
DEFINITIONS
Construct:
A measure of some trait/attribute/quality that is not operationally defined. A latent entity whose level and relationship with other objects (either latent or manifest) can only be inferred
Latent:
Extant, but not perceivable by bodily senses
Cronbach & Meehl (1955)
Test Design
TEST DESIGN
Test Philosophy:

What will/will not your test measure ?

What construct are you hoping to makes inferences?
What is required for your test to measure that construct?
TEST DESIGN
Context
Person Ability/Trait (Construct)
Cognitive Process(es) Item Response
TEST DESIGN
Test Purpose: What information do you want to obtain from this test? and What decision(s) do you need to make from this information?
TEST DESIGN
Examinee Population:
For whom is this test intended?
TEST DESIGN
Constraints
Time to take test Platform

Paper vs. Computer security/standardization

Location
Administration
Entire Group vs. Subgroups vs. Individual
Test Score Properties: Reliability and Validity
RELIABILITY
Reliability

Do the test scores measure its construct consistently? Contributors to inconsistency

Randomness (vary from examinee to examinee) Systematic (consistent for all examinees)

Effects can be innocuous or severe, depending on the: purpose of the test
RELIABILITY
Estimation
0 < reliability < 1 Published: .80-.95 Classroom: .50

Methods

Correlation between 2 administrations (of same test) Correlation among test items
Internal Consistency ()
See Frisbee (1988)
RELIABILITY
Influences on Reliability Estimates

Length Dimensionality

How many constructs is the test measuring?
Item Difficulty Item Discrimination

How likely is a response in examinees high on the construct vs. examinees low on the construct?
Heterogeneity of the examinees Student Factors (motivation, testwiseness) Time Allotment Security

VALIDITY
Validity

Are the test scores measuring the intended construct? An argument, for which you need multiple stands of evidence, e.g.:
Do they appear to measure what its intended construct? Do experts think they are measuring its intended construct? Do they have relationships with other measures that Measure the same things Measure different things Do they predict outcomes of interest? Do the tests items have a basis in the curriculum?

See AERA/APA/NCME (1999)

Cognitive Processes
ASSESSMENT PROCESS
Good Classroom Assessments Flow From the Classs Instructional Objectives/Learning Outcomes And Allow Inferences About the Construct of Interest
ASSESSMENT PROCESS
Learning Objectives
Cognitive Processes
Item Responses
Test Scores
Construct
Inference
BLOOMS TAXONOMY
EVALUATION SYNTHESIS
Bloom (1956) Development/ Difficulty
ANALYSIS APPLICATION COMPREHENSION KNOWLEDGE
BLOOMS TAXONOMY
Level 1-Knowledge
Recall information Some item stems: recall, recite, list, label, define, identify, quote, who, what, when, where, tell list, describe, relate, locate, write, find, state, name
Examples:
Define consubstantiation. Who was Constantine? When were the first Crusades? List the five points of Calvinism.

BLOOMS TAXONOMY
Level 2-Comprehension

Understand information Some item stems: demonstrate, explain, describe, interpret, summarize, cause-effect, explain interpret, outline, discuss, distinguish, restate, translate, describe
Examples:
Why did Paul to write to the church at Philippi?
(a) Address the issue of rivals, and uphold his apostleship (b) To preserve the view of justification by faith (c) To emphasize that under salvation by Christ, Jews and Gentiles are brought together
BLOOMS TAXONOMY
Level 3-Application
Use information Some item stems: demonstrate, apply, calculate, illustrate, show, construct, interview, solve, show use, illustrate, construct, complete, examine, classify
Example:
Translate the following into English:

BLOOMS TAXONOMY
Level 4- Analysis

Examine/break apart information Some item stems: explain, connect, classify, categorize, compare, analyze, distinguish, examine compare, contrast, investigate, categorize, explain, separate
Example:
Compare Platos Republic with Lenins April Theses Which of the following names of God is most different from the other three: (a) JEHOVAH (b) ELOHIM (c) KURIOS (d) DESPOTES
BLOOMS TAXONOMY
Level 5- Synthesis
Create with information Some item stems: combine, integrate, modify, hypothesize, abstract, create, design, invent compose, predict, plan, imagine, propose, devise, formulate, conjecture
Example:
Conjecture about Stephens response to Paul ne Saul, were they to have met after Pauls Roman imprisonment.
BLOOMS TAXONOMY
Level 6- Evaluation
Combine previous information skills to make a judgment Some item stems: judge, select, choose, decide justify, debate, verify, argue, recommend, assess discuss, rate, prioritize, determine
Example:

Appraise Calvins Institutes in light of Obermans The Dawn of the Reformation. Who deserves precedence as the earliest Baptist church in North America: Roger Williams Providence church or John Clarkes Newport church. Support your answer with scholarly sources.
Some Item Types
ASSESSMENT PROCESS
Learning Objectives
Cognitive Processes
Item Responses
Test Scores
Construct
Inference
ITEM TYPE #1 TRUE-FALSE

Example: Augustine wrote The Confessions. T/F? Pros:

Convenient to write Easy to score Allows flexibility in content coverage
Cons:
Limited in cognitive processes covered Guessing Student response sets

ITEM TYPE #1 TRUE-FALSE

Best Practice:

Make the statements as short and specific as possible One idea per statement Avoid trivial information Use positive statements instead of negative, and always avoid double negative statements Do not use opinion statements unless they are attributed to someone Length should not differ between true/false statements Approximately equal number of true/false statements
ITEM TYPE #2 MULTIPLE CHOICE RESPONSE

Example: Who is famous for his 95 Theses?

(a) Pope Leo X; (b) Martin Luther; (c) Johann Eck
Pros:
Best Answer is more flexible than unequivocal true/false Allows different cognitive processes in item response Guessing less of a factor than T/F Easy to score
Cons:
Large amount of time to write good distracters (wrong response alternatives) Guessing is possible

ITEM TYPE #2 MULTIPLE CHOICE RESPONSE

Best Practice:

Item stems should: (a) have autonomous meaning , (b) present as much of the item as possible, and (c) have no irrelevant material Avoid negative item stems All item responses should be grammatically compatible with their stem and of approximately equal length There should be only one correct/best answer Distracters should be plausible Avoid clues in item stem Avoid none/all of the above response options
ITEM TYPE #3 MATCHING

Example: Match the philosopher with their work:

A. B. C. D. E. Plato Aristotle Socrates Euclid Zeno _A__ 1. The Socratic Dialogues _C__ 2. None _B___3. Organon _D__ 4. The Elements _E___5. Reminiscences of Crates
Pros

Can cover much material in content domain Easy to administer Limited in cognitive processes covered Difficult to find homogenous material Difficult to develop good, plausible set of responses
Cons

ITEM TYPE #3 MATCHING

Best Practice: Use homogenous material Have an unequal numbers of stems and responses Place responses in numerical or alphabetical order Explicitly state the basis for finding a match Place all items/responses on the same page
ITEM TYPE #4 FILL IN THE BLANK

Example: Martin Buber edited the _______, a Zionist periodical. (Die Welt) Pros:
Very, very minimal guessing Easy to construct item stems

Cons:
Must score by hand, and possibility of multiple correct responses. Assess only factual knowledge
ITEM TYPE #4 FILL IN THE BLANK

Best Practice:
Make the item require a short, specific response Do not take items stems directly from textbooks Questions are better than incomplete statements Right or left justify the item response blanks, and make them the same size for all items Only one blank per item

ITEM TYPE #5A SHORT RESPONSE

Example: List the Beatitudes. Pros:
Can measure complex learning objectives and cognitive processes Minimizes cheating

Cons:

Scoring can be subjective Limited sampling of content
ITEM TYPE #5B ESSAY

Example: Explain how Nietzsche's notion of the will to power is a response to Schopenhauer's will to live?
(Your answer should be no longer than 2 pages, and should cite scholarly sources. It will be evaluated on your analysis of cited scholarship and the skill at which the essay is organized)
Pros:

Can help students connect related ideas

Responding can (possibly) be a learning exercise itself
Can measure complex objectives & processes
Cons:
Relies on both writing skills and content familiarity Scoring is subjective less score reliability Limited sampling of content

ITEM TYPE #5 SHORT ANSWER/ESSAY

Best Practice:

Only use for learning outcomes that require nonobjective assessment Map the questions directly onto learning objectives Inform respondents on the grading criteria (e.g., content knowledge, thought organization) Make the examinees writing task explicit Estimate the time needed for an appropriate answer Give all examinees the same (or equivalent) questions. Avoid optional questions. Outline the expected answer in advance, and Develop a rubric that allocates points in the desired manner before administering exam
Developing the Test
TEST SPECIFICATIONS
Content Domain
How do topics within the content area relate to each other and how does knowledge in the area build?
Cognitive Skills/Process to Answer Item Distribution of Content Areas and Cognitive Skills Demand throughout Test
TEST SPECIFICATIONS
For Classroom Evaluations, You Want Your Tests to Map onto Your Instructional Objectives/Learning Outcomes
Instructional Objective/ Learning Outcomes I. Demonstrates Skill in Critical Thinking A. Comprehends Relevant Antecedents to Historical Events
Test Item Name Three Precipitating Events to the First Crusade
TABLE OF SPECIFICATIONS
Instructional Objectives
Knowledge Comprehension Application Analysis Synthesis Total Items
Major Content Area

Early Christian Writers in the West Luther and the Beginning of the Reformation Liberal Protestantism in Modernity Total Items 2 3 3 8
3 2 2 7
2 3 2 7
1 3 2 6
2 4 1 7
10 15 10 35
Content Weight
Objective Weight
TEST LENGTH
No correct length Depends on:

Administration time Examinees Scores needed Content coverage Item types used Desired reliability
TEST ORGANIZATION
Directions
Be explicit
Give time allowed to take test Give directions for responding Give point allocation (weighting) if different across items

Item Grouping
If there are different item types on the test:

Only if needed, group items by content area Put same items types together Within a type, place in order of simpler to more complex

TEST/ITEM SCORING
Points to Consider
Allow for partial credit? Should content areas be weighted equally? Should learning objectives be weighted equally? If a test is made of multiple subtests, is each autonomous or graded as a whole?

e.g., if Jane missing all 10 of the Liberal Protestantism in Modernity questions, but gets the other 25 items correct, can she still pass the test?
TEST/ITEM ANALYSIS
A Multiple Item Test Provides Much Information

Item difficulties (e.g., percent who passed the item)

Does they differ by content area? Does they differ by instructional objective? Are high scorers endorsing a distracter more than the correct answer? How well does an item discriminate high scorers from low scorers Is there a pattern in those items?
Distracters
Discrimination

Are there omitted items or items not reached?

Reliability Calculations & Validity Evidence

TEST/ITEM ANALYSIS
For More Information:

EDP 5340. Measurement/Evaluation Chapter 13 of: Hollis-Sawyer, Thornton, Hurd, & Condon (2008) Chapter 14 of: Linn & Miller (2005) Chapter 6 of Urbina (2004) LERTAP program [http://www.assess.com/ ]

Take Home Message
TAKE HOME MESSAGE

Be Mindful In Test Construction Be Purposeful in Item Selection and Development
Questions?
REFERENCES
American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education [AERA/APA/NCME]. (1999), Standards for educational and psychological testing, Washington, DC: American Psychological Association. Bloom B. S. (1956). Taxonomy of educational objectives, Handbook I: The cognitive domain. New York: David McKay Co Inc. Brennan, R. L. (Ed.) (2006), Educational measurement (4th ed.). Westport, CT: Praeger. Cronbach, L. J. & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302. Frisbie, D. A. (1988). Reliability of scores from teacher-made tests, Educational measurement: Issues and practice, 7, 25-35. [free: http://www.ncme.org/pubs/items/ITEMS_Mod_3.pdf ]
REFERENCES
Hollis-Sawyer, L., Thornton, G. C., Hurd, B. & Condon, M. E. (2008). Exercises in psychological testing (2nd ed.). Boston: Allyn & Bacon Linn, R. L. & Miller, M. D. (2005). Measurement and assessment in teaching (9th ed.). Upper Saddle River, NJ: Pearson. Urbina, S. (2004). Essentials of psychological testing. Hoboken, N.J.: John Wiley & Sons.

Assessment

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Assessment

Загружено:

Авторское право:

Доступные форматы

THEORY OF MEASUREMENT:

Alexander Beaujean William Shiu

2008 Teaching Colloquy Department of Religion

2008 Teaching Colloquy, Department of Religion

2008 Teaching Colloquy Department of Religion

2008 Teaching Colloquy, Department of Religion

2008 Teaching Colloquy, Department of Religion

Genesis is the first book of the Bible. T/F

2008 Teaching Colloquy, Department of Religion

Extant, but not perceivable by bodily senses

Cronbach & Meehl (1955)

2008 Teaching Colloquy, Department of Religion

2008 Teaching Colloquy Department of Religion

What will/will not your test measure ?

What construct are you hoping to makes inferences?

What is required for your test to measure that construct?

2008 Teaching Colloquy, Department of Religion

Person Ability/Trait (Construct)

Cognitive Process(es) Item Response

2008 Teaching Colloquy, Department of Religion

2008 Teaching Colloquy, Department of Religion

For whom is this test intended?

2008 Teaching Colloquy, Department of Religion

Paper vs. Computer security/standardization

2008 Teaching Colloquy, Department of Religion

Test Score Properties: Reliability and Validity

2008 Teaching Colloquy Department of Religion

Do the test scores measure its construct consistently? Contributors to inconsistency

Effects can be innocuous or severe, depending on the: purpose of the test

2008 Teaching Colloquy, Department of Religion

See Frisbee (1988)

2008 Teaching Colloquy, Department of Religion

Influences on Reliability Estimates

How many constructs is the test measuring?

Item Difficulty Item Discrimination

See AERA/APA/NCME (1999)

2008 Teaching Colloquy Department of Religion

2008 Teaching Colloquy, Department of Religion

2008 Teaching Colloquy, Department of Religion

ANALYSIS APPLICATION COMPREHENSION KNOWLEDGE

2008 Teaching Colloquy, Department of Religion

2008 Teaching Colloquy, Department of Religion

2008 Teaching Colloquy, Department of Religion

Translate the following into English:

2008 Teaching Colloquy, Department of Religion

2008 Teaching Colloquy, Department of Religion

2008 Teaching Colloquy, Department of Religion

Some Item Types

2008 Teaching Colloquy Department of Religion

2008 Teaching Colloquy, Department of Religion

ITEM TYPE #1 TRUE-FALSE

Example: Augustine wrote The Confessions. T/F? Pros:

ITEM TYPE #1 TRUE-FALSE

2008 Teaching Colloquy, Department of Religion

ITEM TYPE #2 MULTIPLE CHOICE RESPONSE

Example: Who is famous for his 95 Theses?

(a) Pope Leo X; (b) Martin Luther; (c) Johann Eck

ITEM TYPE #2 MULTIPLE CHOICE RESPONSE

ITEM TYPE #3 MATCHING

Example: Match the philosopher with their work:

ITEM TYPE #3 MATCHING

2008 Teaching Colloquy, Department of Religion

ITEM TYPE #4 FILL IN THE BLANK

Very, very minimal guessing Easy to construct item stems