Академический Документы
Профессиональный Документы
Культура Документы
PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Fri, 19 Apr 2013 20:34:01 UTC
Contents
Articles
Accuracy and precision Activity vector analysis Adaptive comparative judgement Anchor test Assessment centre Assessment day Base rate Bias in Mental Testing Bipolar spectrum diagnostic scale Borderline intellectual functioning Choice set Citizen survey Classical test theory Cluster analysis (in marketing) Cognitive Process Profile Common-method variance Computer-Adaptive Sequential Testing Computerized adaptive testing Computerized classification test Congruence coefficient Conjoint analysis Correction for attenuation Counternull Criterion-referenced test Cronbach's alpha Cutscore Descriptive statistics Dot cancellation test Elementary cognitive task Equating Factor analysis Figure rating scale Fuzzy concept G factor (psychometrics) 1 6 7 10 10 11 11 13 15 15 17 18 18 22 24 25 26 26 32 36 37 38 40 41 43 46 46 48 48 49 51 61 61 69
Francis Galton Group size measures Guttman scale High-stakes testing Historiometry House-Tree-Person test Idiographic image Intelligence quotient Internal consistency Intra-rater reliability IPPQ Item bank Item response theory Jenkins activity survey Jensen box KuderRichardson Formula 20 Latent variable Law of comparative judgment Likert scale Linear-on-the-fly testing Frederic M. Lord Measurement invariance Mediation (statistics) Mental age Mental chronometry Missing completely at random Moderated mediation Moderation (statistics) Multidimensional scaling Multiple mini interview Multistage testing Multitrait-multimethod matrix Neo-Piagetian theories of cognitive development NOMINATE (scaling method) Non-response bias Norm-referenced test Normal curve equivalent Objective test
90 100 103 106 109 111 112 114 134 135 136 137 138 147 148 149 150 151 155 158 159 159 160 169 170 177 178 180 183 187 189 190 193 207 212 213 216 217
Online assessment Operational definition Operationalization Opinion poll Optimal discriminant analysis Pairwise comparison Pathfinder network Perceptual mapping Person-fit analysis Phrase completions Point-biserial correlation coefficient Polychoric correlation Polynomial conjoint measurement Polytomous Rasch model Progress testing Projective test Prometric Psychological statistics Psychometric function Psychometrics of racism Quantitative marketing research Quantitative psychology Questionnaire construction Rasch model Rasch model estimation Rating scale Rating scales for depression Reliability (psychometrics) Repeatability Reproducibility Riddle scale Risk Inclination Formula Risk Inclination Model Role-based assessment Scale (social sciences) Self-report inventory Semantic differential Sequential probability ratio test
218 220 225 228 237 238 241 243 245 246 247 249 250 252 256 260 265 266 268 269 270 273 275 279 288 290 292 294 298 300 302 304 304 305 307 311 314 316
SESAMO Situational judgement test Psychometric software SpearmanBrown prediction formula Standard-setting study Standards for Educational and Psychological Testing StanfordBinet Intelligence Scales Stanine Statistical hypothesis testing Statistical inference Survey methodology Sten scores Structural equation modeling Lewis Terman Test (assessment) Test score Theory of conjoint measurement Thurstone scale Thurstonian model Torrance Tests of Creative Thinking William H. Tucker Validity (statistics) Values scales Vestibulo emotional reflex Visual analogue scale Youth Outcome Questionnaire Attribute Hierarchy Method Differential item functioning Psychometrics Vineland Adaptive Behavior Scale
319 323 328 336 337 338 340 344 345 360 368 374 375 381 385 393 394 405 407 408 411 413 419 422 424 425 426 437 446 454
References
Article Sources and Contributors Image Sources, Licenses and Contributors 455 463
Article Licenses
License 465
Accuracy indicates proximity of measurement results to the true value, precision to the repeatability, or reproducibility of the measurement
A measurement system can be accurate but not precise, precise but not accurate, neither, or both. For example, if an experiment contains a systematic error, then increasing the sample size generally increases precision but does not improve accuracy. The result would be a consistent yet inaccurate string of results from the flawed experiment. Eliminating the systematic error improves accuracy but does not change precision. A measurement system is designated valid if it is both accurate and precise. Related terms include bias (non-random or directed effects caused by a factor or factors unrelated to the independent variable) and error (random variability). The terminology is also applied to indirect measurementsthat is, values obtained by a computational procedure from observed data. In addition to accuracy and precision, measurements may also have a measurement resolution, which is the smallest change in the underlying physical quantity that produces a response in the measurement. In the case of full reproducibility, such as when rounding a number to a representable floating point number, the word precision has a meaning not related to reproducibility. For example, in the IEEE 754-2008 standard it means the number of bits in the significand, so it is used as a measure for the relative accuracy with which an arbitrary number can be represented.
The analogy used here to explain the difference between accuracy and precision is the target comparison. In this analogy, repeated measurements are compared to arrows that are shot at a target. Accuracy describes the closeness of arrows to the bullseye at the target center. Arrows that strike closer to the bullseye are considered more accurate. The closer a system's measurements are to the accepted value, the more accurate the system is considered to be.
To continue the analogy, if a large number of arrows are shot, precision would be the size of the arrow cluster. (When only one arrow is shot, precision is the size of the cluster one would expect if this were repeated many times under the same conditions.) When all arrows are grouped tightly together, the cluster is considered precise since they all struck close to the same spot, even if not necessarily near the bullseye. The measurements are precise, though not necessarily accurate. However, it is not possible to reliably achieve accuracy in individual measurements High precision, but low without precisionif the arrows are not grouped close to one another, they cannot all be accuracy close to the bullseye. (Their average position might be an accurate estimation of the bullseye, but the individual arrows are inaccurate.) See also circular error probable for application of precision to the science of ballistics.
Quantification
Ideally a measurement device is both accurate and precise, with measurements all close to and tightly clustered around the known value. The accuracy and precision of a measurement process is usually established by repeatedly measuring some traceable reference standard. Such standards are defined in the International System of Units (abbreviated SI from French: Systme international d'units) and maintained by national standards organizations such as the National Institute of Standards and Technology in the United States. This also applies when measurements are repeated and averaged. In that case, the term standard error is properly applied: the precision of the average is equal to the known standard deviation of the process divided by the square root of the number of measurements averaged. Further, the central limit theorem shows that the probability distribution of the averaged measurements will be closer to a normal distribution than that of individual measurements. With regard to accuracy we can distinguish: the difference between the mean of the measurements and the reference value, the bias. Establishing and correcting for bias is necessary for calibration. the combined effect of that and precision. A common convention in science and engineering is to express accuracy and/or precision implicitly by means of significant figures. Here, when not explicitly stated, the margin of error is understood to be one-half the value of the last significant place. For instance, a recording of 843.6m, or 843.0m, or 800.0m would imply a margin of 0.05m (the last significant place is the tenths place), while a recording of 8,436m would imply a margin of error of 0.5m (the last significant digits are the units). A reading of 8,000m, with trailing zeroes and no decimal point, is ambiguous; the trailing zeroes may or may not be intended as significant figures. To avoid this ambiguity, the number could be represented in scientific notation: 8.0103m indicates that the first zero is significant (hence a margin of 50m) while 8.000103m indicates that all three zeroes are significant, giving a margin of 0.5m. Similarly, it is possible to use a multiple of the basic measurement unit: 8.0km is equivalent to 8.0103m. In fact, it indicates a margin of 0.05km (50m). However, reliance on this convention can lead to false precision errors when accepting data from sources that do not obey it. Precision is sometimes stratified into: Repeatability the variation arising when all efforts are made to keep conditions constant by using the same instrument and operator, and repeating during a short time period; and Reproducibility the variation arising using the same measurement process among different instruments and operators, and over longer time periods.
In binary classification
Accuracy is also used as a statistical measure of how well a binary classification test correctly identifies or excludes a condition.
Condition as determined by Gold standard True Test Positive outcome Negative True positive False False positive Positive predictive value or Precision Negative predictive value Accuracy
False negative
True negative
That is, the accuracy is the proportion of true results (both true positives and true negatives) in the population. It is a parameter of the test.
On the other hand, precision or positive predictive value is defined as the proportion of the true positives against all the positive results (both true positives and false positives)
An accuracy of 100% means that the measured values are exactly the same as the given values. Also see Sensitivity and specificity. Accuracy may be determined from Sensitivity and Specificity, provided Prevalence is known, using the equation:
The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. It may be better to avoid the accuracy metric in favor of other metrics such as precision and recall.[citation needed] In situations where the minority class is more important, F-measure may be more appropriate, especially in situations with very skewed class imbalance.
Accuracy and precision Another useful performance measure is the balanced accuracy which avoids inflated performance estimates on imbalanced datasets. It is defined as the arithmetic mean of sensitivity and specificity, or the average accuracy obtained on either class:
If the classifier performs equally well on either class, this term reduces to the conventional accuracy (i.e., the number of correct predictions divided by the total number of predictions). In contrast, if the conventional accuracy is above chance only because the classifier takes advantage of an imbalanced test set, then the balanced accuracy, as appropriate, will drop to chance.[3] A closely related chance corrected measure is:
[]
while a direct approach to debiasing and renormalizing Accuracy is Cohen's kappa whilst Informedness has been shown to be a Kappa family debiased renormalization of Recall.[4] Informedness and Kappa have the advantage that chance level is defined to be 0, and they have the form of a probability. Informedness has the stronger property that it is the probability that an informed decision is made (rather than a guess), when positive. When negative this is still true for the absolutely value of Informedness, but the information has been used to force an incorrect response.[]
In logic simulation
In logic simulation, a common mistake in evaluation of accurate models is to compare a logic simulation model to a transistor circuit simulation model. This is a comparison of differences in precision, not accuracy. Precision is measured with respect to detail and accuracy is measured with respect to reality.[5][6]
In information systems
The concepts of accuracy and precision have also been studied in the context of data bases, information systems and their sociotechnical context. The necessary extension of these two concepts on the basis of theory of science suggests that they (as well as data quality and information quality) should be centered on accuracy defined as the closeness to the true value seen as the degree of agreement of readings or of calculated values of one same conceived entity, measured or calculated by different methods, in the context of maximum possible disagreement.[7]
References
[1] JCGM 200:2008 International vocabulary of metrology (http:/ / www. bipm. org/ utils/ common/ documents/ jcgm/ JCGM_200_2008. pdf) Basic and general concepts and associated terms (VIM) [2] BS ISO 5725-1: "Accuracy (trueness and precision) of measurement methods and reults - Part 1: General principles and definitions", pp.1 (1994) [3] K.H. Brodersen, C.S. Ong, K.E. Stephan, J.M. Buhmann (2010). The balanced accuracy and its posterior distribution (http:/ / www. icpr2010. org/ pdfs/ icpr2010_WeBCT8. 62. pdf). Proceedings of the 20th International Conference on Pattern Recognition, 3121-3124. [5] John M. Acken, Encyclopedia of Computer Science and Technology, Vol 36, 1997, page 281-306 [6] 1990 Workshop on Logic-Level Modelling for ASICS, Mark Glasser, Rob Mathews, and John M. Acken, SIGDA Newsletter, Vol 20. Number 1, June 1990 [7] Ivanov, K. (1972). "Quality-control of information: On the concept of accuracy of information in data banks and in management information systems" (http:/ / www. informatik. umu. se/ ~kivanov/ diss-avh. html).
External links
BIPM - Guides in metrology (http://www.bipm.org/en/publications/guides/) - Guide to the Expression of Uncertainty in Measurement (GUM) and International Vocabulary of Metrology (VIM) "Beyond NIST Traceability: What really creates accuracy" (http://img.en25.com/Web/Vaisala/NIST-article. pdf) - Controlled Environments magazine Precision and Accuracy with Three Psychophysical Methods (http://www.yorku.ca/psycho) Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results, Appendix D.1: Terminology (http://physics.nist.gov/Pubs/guidelines/appd.1.html) Accuracy and Precision (http://digipac.ca/chemical/sigfigs/contents.htm) Accuracy vs Precision (http://www.youtube.com/watch?v=_LL0uiOgh1E&feature=youtube_gdata_player) a brief, clear video by Matt Parker
References
[1] Edwin A. Locke, Charles L. Hulin, 'A review and evaluation of the validity studies of activity vector analysis', Personnel Psychology, Volume 15, Issue 1, pages 2542, March 1962 | http:/ / onlinelibrary. wiley. com/ doi/ 10. 1111/ j. 1744-6570. 1962. tb01844. x/ abstract [2] http:/ / www. bizet. com/ ava. php?pg=history_ava | Retrieved 2012-03-03
Introduction
Traditional exam script marking began in Cambridge 1792 when, with undergraduate numbers rising, the importance of proper ranking of students was growing. So in 1792 the new Proctor of Examinations, William Farish, introduced marking, a process in which every examiner gives a numerical score to each response by every student, and the overall total mark puts the students in the final rank order. Francis Galton (1869) noted that, in an unidentified year about 1863, the Senior Wrangler scored 7,634 out of a maximum of 17,000, while the Second Wrangler scored 4,123. (The Wooden Spoon scored only 237.) Prior to 1792, a team of Cambridge examiners convened at 5pm on the last day of examining, reviewed the 19 papers each student had sat and published their rank order at midnight. Marking solved the problems of numbers and prevented unfair personal bias, and its introduction was a step towards modern objective testing, the format it is best suited to. But the technology of testing that followed, with its major emphasis on reliability and the automatisation of marking, has been an uncomfortable partner for some areas of educational achievement: assessing writing or speaking, and other kinds of performance need something more qualitative and judgemental. The technique of Adaptive Comparative Judgement is an alternative to marking. It returns to the pre-1792 idea of sorting papers according to their quality, but retains the guarantee of reliability and fairness. It is by far the most reliable way known to score essays or more complex performances. It is much simpler than marking, and has been preferred by almost all examiners who have tried it. The real appeal of Adaptive Comparative Judgement lies in how it can re-professionalise the activity of assessment and how it can re-integrate assessment with learning.
History
Thurstone s Law of Comparative Judgement
There is no such thing as absolute judgement" Laming (2004)[1] The science of comparative judgement began with Louis Leon Thurstone of the University of Chicago. A pioneer of psychophysics, he proposed several ways to construct scales for measuring sensation and other psychological properties. One of these was the Law of comparative judgment (Thurstone, 1927a, 1927b),[2][3] which defined a mathematical way of modeling the chance that one object will beat another in a comparison, given values for the quality of each. This is all that is needed to construct a complete measurement system. A variation on his model (see Pairwise comparison and the BTL model), states that the difference between their quality values is equal to the log of the odds that object-A will beat object-B:
Before the availability of modern computers, the mathematics needed to calculate the values of each objects quality meant that the method could only be used with small sets of objects, and its application was limited. For Thurstone, the objects were generally sensations, such as intensity, or attitudes, such as the seriousness of crimes, or statements of opinions. Social researchers continued to use the method, as did market researchers for whom the objects might be different hotel room layouts, or variations on a proposed new biscuit.
Adaptive comparative judgement In the 1970s and 1980s Comparative Judgement appeared, almost for the first time in educational assessment, as a theoretical basis or precursor for the new Latent Trait or Item Response Theories. (Andrich, 1978) These models are now standard, especially in item banking and adaptive testing systems.
Re-introduction in education
The first published paper using Comparative Judgement in education was Pollitt & Murray (1994), essentially a research paper concerning the nature of the English proficiency scale assessed in the speaking part of Cambridges CPE exam. The objects were candidates, represented by 2-minute snippets of video recordings from their test sessions, and the judges were Linguistics post-graduate students with no assessment training. The judges compared pairs of video snippets, simply reporting which they thought the better student, and were then clinically interviewed to elicit the reasons for their decisions. Pollitt then introduced Comparative Judgement to the UK awarding bodies, as a method for comparing the standards of A Levels from different boards. Comparative judgement replaced their existing method which required direct judgement of a script against the official standard of a different board. For the first two or three years of this Pollitt carried out all of the analyses for all the boards, using a program he had written for the purpose. It immediately became the only experimental method used to investigate exam comparability in the UK; the applications for this purpose from 1996 to 2006 are fully described in Bramley (2007) [4] In 2004 Pollitt presented a paper at the conference of the International Association for Educational Assessment titled Lets Stop Marking Exams, and another at the same conference in 2009 titled Abolishing Marksism. In each paper the aim was to convince the assessment community that there were significant advantages to using Comparative Judgement in place of marking for some types of assessment. In 2010 he presented a paper at the Association for Educational Assessment Europe, How to Assess Writing Reliably and Validly, which presented evidence of the extraordinarily high reliability that has been achieved with Comparative Judgement in assessing primary school pupilsskill in first language English writing.
Adaptive comparative judgement Developments and Pollitt ran three trials, increasing the sample size from 20 to 249 students, and developing both the judging system and the assessment system. There are three pilots, involving Geography and Science as well as the original in Design & Technology. Primary school writing In late 2009 TAG Developments and Pollitt trialled a new version of the system for assessing writing. A total of 1000 primary school scripts were evaluated by a team of 54 judges in a simulated national assessment context. The reliability of the resulting scores after each script had been judged 16 times was 0.96, considerably higher than in any other reported study of similar writing assessment. Further development of the system has shown that reliability of 0.93 can be reached after about 9 judgements of each script, when the system is no more expensive than single marking but still much more reliable. Several projects are underway at present, in England, Scotland, Ireland, Israel, Singapore and Australia. They range from primary school to university in context, and include both formative and summative assessment, from writing to Mathemtatics. The basic web system is now available on a commercial basis from TAG Developments (http:/ / www.tagdevelopments.com), and can be modified to suit specific needs.
References
[1] * Laming, D R J (2004) Human judgment : the eye of the beholder. London, Thomson. [2] Thurstone, L L (1927a). Psychophysical analysis. American Journal of Psychology, 38, 368-389. Chapter 2 in Thurstone, L.L. (1959). The measurement of values. University of Chicago Press, Chicago, Illinois. [3] Thurstone, L L (1927b). The method of paired comparisons for social values. Journal of Abnormal and Social Psychology, 21, 384-400. Chapter 7 in Thurstone, L.L. (1959). The measurement of values. University of Chicago Press, Chicago, Illinois [4] Bramley, T (2007) Paired comparison methods. In Newton, P, Baird, J, Patrick, H, Goldstein, H, Timms, P and Wood, A (Eds). Techniques for monitoring the comparability of examination standards. London, QCA. [5] Kimbell R, A and Pollitt A (2008) Coursework assessment in high stakes examinations: authenticity, creativity, reliability Third international Rasch measurement conference. Perth: Western Australia: January.
APA, AERA and NCME (1999) Standards for Educational and Psychological Testing. Galton, F (1855) Hereditary genius : an inquiry into its laws and consequences. London : Macmillan. Kimbell, R A, Wheeler A, Miller S, and Pollitt A (2007) e-scape portfolio assessment (e-solutions for creative assessment in portfolio environments) phase 2 report. TERU Goldsmiths, University of London ISBN 978-1-904158-79-0 Pollitt, A (2004) Lets stop marking exams. Annual Conference of the International Association for Educational Assessment, Philadelphia, June. Available at http://www.camexam.co.uk publications. Pollitt, A, (2009) Abolishing Marksism, and rescuing validity. Annual Conference of the International Association for Educational Assessment, Brisbane, September. Available at http://www.camexam.co.uk publications. Pollitt, A, & Murray, NJ (1993) What raters really pay attention to. Language Testing Research Colloquium, Cambridge. Republished in Milanovic, M & Saville, N (Eds), Studies in Language Testing 3: Performance Testing, Cognition and Assessment, Cambridge University Press, Cambridge.
External links
E-scape
Anchor test
10
Anchor test
In psychometrics, an anchor test is a common set of test items administered in combination with two or more alternative forms of the test with the aim of establishing the equivalence of the test scores on the alternative forms. The purpose of the anchor test is to provide a baseline for an equating analysis between different forms of a test.[1]
References
[1] Kolen, M.J., & Brennan, R.L. (1995). Test Equating. New York: Spring.
Assessment centre
An assessment centre is a place at which a person, such as a member of staff, is assessed to determine their suitability for particular roles, especially management or military command. The candidates' personality and aptitudes are determined by a variety of techniques including interviews, examinations and psychometric testing.
History
Assessment centres were first created in World War II to select officers. Examples include the Admiralty Interview Board of the Royal Navy and the War Office Selection Board of the British Army.[1] AT&T created a building for recruitment of staff in the 1950s. This was called The Assessment Centre and this was influential on subsequent personnel methods in other businesses.[2] Other companies use this method to recruit for their graduate programmes by assessing the personality and intellect of potential employees who are fresh out of university and have no work history. The big four accountancy firms conduct assessment centre days to recruit their trainees. 68% of employers in the UK and USA now use some form of assessment centre as part of their recruitment/promotion process.[3]
[4]
References
[3] www.assessmentcentrehq.com
Assessment day
11
Assessment day
An assessment day is usually used in the context of recruitment. On this day, the job applicants are invited to an assessment centre where there are a combination of more than one objective selection techniques used to measure suitability for a job.These technique include exercises such as e-tray, in-tray, presentation, group exercise, attending conference call, role play, personality questionnaire etc. Most large companies now use this method to recruit the fresh talent in their graduate programmes. There are many consultancies who focus on preparing the candidates for these assessment days, for example, Green Turn is a famous consultancy who trains applicants for assessment days of big 4 accountancy firms.
History
Assessment centres were first created in World War II to select officers. Examples include the Admiralty Interview Board of the Royal Navy and the War Office Selection Board of the British Army.[1] AT&T created a building for recruitment of staff in the 1950s. This was called The Assessment Centre and this was influential on subsequent personnel methods in other businesses.[2]
References
Base rate
In probability and statistics, base rate generally refers to the (base) class probabilities unconditioned on featural evidence, frequently also known as prior probabilities. In plainer words, if it were the case that 1% of the public were "medical professionals", and 99% of the public were not "medical professionals", then the base rate of medical professionals is simply 1%. In science, particularly medicine, the base rate is critical for comparison. It may at first seem impressive that 1000 people beat their winter cold while using 'Treatment X', until we look at the entire 'Treatment X' population and find that the base rate of success is actually only 1/100 (i.e. 100 000 people tried the treatment, but the other 99 000 people never really beat their winter cold). The treatment's effectiveness is clearer when such base rate information (i.e. "1000 people... out of how many?") is available. Note that controls may likewise offer further information for comparison; maybe the control groups, who were using no treatment at all, had their own base rate success of 5/100. Controls thus indicate that 'Treatment X' actually makes things worse, despite that initial proud claim about 1000 people.
Overview
Mathematician Keith Devlin provides an illustration of the risks of committing, and the challenges of avoiding, the base rate fallacy. He asks us to imagine that there is a type of cancer that afflicts 1% of all people. A doctor then says there is a test for that cancer which is about 80% reliable. He also says that the test provides a positive result for 100% of people who have the cancer, but it is also results in a 'false positive' for 20% of people - who actually do not have the cancer. Now, if we test positive, we may be tempted to think it is 80% likely that we have the cancer. Devlin explains that, in fact, our odds are less than 5%. What is missing from the jumble of statistics is the most relevant base rate information. We should ask the doctor "Out of the number of people who test positive at all (this is the base rate group that we care about), how many end up actually having the cancer?".[1] Naturally, in assessing the probability that a given individual is a member of a particular class, we must account for other information besides the base rate. In particular, we must account for featural evidence. For example, when we see a person
Base rate wearing a white doctor's coat and stethoscope, and prescribing medication, we have evidence which may allow us to conclude that the probability of this particular individual being a "medical professional" is considerably greater than the category base rate of 1%. The normative method for integrating base rates (prior probabilities) and featural evidence (likelihoods) is given by Bayes rule. A large number of psychological studies have examined a phenomenon called base-rate neglect in which category base rates are not integrated with featural evidence in the normative manner.
12
References
[1] http:/ / www. edge. org/ responses/ what-scientific-concept-would-improve-everybodys-cognitive-toolkit
13
Bias in Mental Testing is a book by Arthur Jensen about the idea of bias in IQ tests.
Background
In 1969, Arthur Jensen's article "How Much Can We Boost IQ and Scholastic Achievement?" initiated an immense controversy because of its suggestion that the reason for the difference in average IQ between African Americans and White Americans might involve genetic as well as cultural factors. One argument against this idea was that IQ tests are culturally biased against African Americans, and that any observed difference in average IQ must therefore be an artifact of the tests themselves. In the 1970s Jensen began researching the idea of test bias, and soon decided it would be beneficial to write a book reviewing the matter. Although he at first intended the book to be rather short, over the course of writing it he came to realize that the topic deserved a much more in-depth analysis, and the book eventually grew into something much larger.[1]
Summary
The book is based on the fact that the average IQ of African Americans had been consistently found to lie approximately 15 points lower than that of White Americans, and the accusation made by some psychologists that IQ tests are therefore culturally biased against African Americans. The book does not address the question whether the cause of the IQ gap is genetic or environmental, but only whether the tests themselves are valid.[2] The book presents several arguments that IQ tests are not biased. African Americans' lower average performance on IQ tests cannot be because of differences in vocabulary, because African Americans have slightly better performance on verbal tests than on nonverbal tests. The IQ difference also cannot be because the tests depend on White culture, or that Whites inevitably do better on tests designed by Whites. In fact, Blacks perform better on tests that are culturally loaded than they do on tests designed to not include cultural references unfamiliar to Blacks, and Japanese children tend to outscore White children by an average of six points. Nor can the difference be a reflection of socioeconomic status, because when Black and White children are tested who are at the same socioeconomic level, the difference between their average IQs is still twelve points.[2] The book also presents evidence that IQ tests work the same way for all English-speaking Americans born in the United States, regardless of race. One is that IQ tests have been very successful in predicting performance for all Americans in school, work, and the armed forces. Another is that the race and sex of the person administering a test does not significantly affect how African Americans perform on it. The ranking in difficulty of test items on IQ tests is the same for both groups, and so is the overall shape of the graph showing the number of people achieving each score, except that the curve is centered slightly lower for Blacks than it is for Whites.[2] Based on this data, Jensen concludes that tests which show a difference in average IQ between races are showing something real, rather than an artifact of the tests themselves. He argues that in competition for college admission
Bias in Mental Testing and jobs, IQ tests have the potential to be more fair that many of the alternatives, because they can judge ability in a way that's colorblind instead of relying on the judgement of an interviewer.[2]
14
References
[1] This Week's Citation Classic (http:/ / garfield. library. upenn. edu/ classics1987/ A1987K668400001. pdf). Current Contests number 46, November 16, 1987 [2] The Return of Arthur Jensen (http:/ / www. time. com/ time/ magazine/ article/ 0,9171,947407,00. html). Time magazine, Sept. 24, 1979 [3] Robert T. Brown, Cecil R. Reynolds, and Jean S. Whitaker."Bias in Mental Testing since Bias in Mental Testing". School Psychology Quarterly, Vol 14(3), 1999, 208-238. [4] Book Review : Perspectives on Bias in Mental Testing Cecil R. Reynolds and Robert T. Brown. Applied Psychological Measurement March 1985 vol. 9 no. 1 99-107. [5] Shephard, Lorie A. "The Case for Bias in Tests of Achievement and Scholastic Aptitude." In Arthur Jensen: Consensus and Controversy, edited by Sohan and Celiea Modgil. The Falmer Press, 1987. Page 189. [6] Brody, Nathan. Intelligence: Second edition. Academic Press, 1992. Page 287. [7] John R. Graham and Jack A Naglieri. Handbook of Psychology. John Wiley & Sons, 2003. Page 58.
15
References
[1] Psychiatric Times. Clinically Useful Psychiatric Scales: Bipolar Spectrum Diagnostic Scale (http:/ / www. psychiatrictimes. com/ clinical-scales/ bsds/ ). Retrieved March 9, 2009.
Basic types
Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social
Applied psychology
16
Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport
Lists
Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal
Borderline intellectual functioning, also called borderline mental retardation, is a categorization of intelligence wherein a person has below average cognitive ability (generally an IQ of 70-85),[1] but the deficit is not as severe as mental retardation (70 or below). It is sometimes called below average IQ (BAIQ). This is technically a cognitive impairment; however, this group is not sufficiently mentally disabled to be eligible for specialized services.[2] Additionally, the DSM-IV-TR codes borderline intellectual functioning as V62.89,[3] which is generally not a billable code, unlike the codes for mental retardation. During school years, individuals with borderline intellectual functioning are often "slow learners."[2] Although a large percentage of this group fails to complete high school and can often achieve only a low socioeconomic status, most adults in this group blend in with the rest of the population.[2] Persons who fall into this categorization have a relatively normal expression of affect for their age, although their ability to think abstractly is rather limited. Reasoning displays a preference for concrete thinking. They are usually able to function day to day without assistance, including holding down a simple job and the basic responsibilities of maintaining a dwelling.
References
[2] The Best Test Preparation for the Advanced Placement Examination in Psychology, Research & Education Association. (2003), p. 99
Further reading
Gillberg, Christopher (1995). Clinical child neuropsychiatry. Cambridge: Cambridge University Press. pp.4748. ISBN0-521-54335-5. Harris, James C. (2006). Intellectual disability : understanding its development, causes, classification, evaluation, and treatment. New York: Oxford University Press. ISBN0-19-517885-8.
Choice set
17
Choice set
A choice set is one scenario, also known as a treatment, provided for evaluation by respondents in a choice experiment. Responses are collected and used to create a choice model. Respondents are usually provided with a series of differing choice sets for evaluation. The choice set is generated from an experimental design and usually involves two or more alternatives being presented together.
Alternatives
A number of hypothetical alternatives, Car A and Car B in this example. There may be one or more Alternatives including the 'None' Alternative.
Attributes
The attributes of the alternatives ideally are mutually exclusive and independent. When this is not possible, attributes are nested.
Levels
Each Attribute has a number of possible levels that the attributes may range over. The specific levels that are shown are driven by an experimental design. Levels are discrete, even in the case that the attribute is a scalar such as price. In this case, the levels are discretized evenly along the range of allowable values.
Choice task
The respondent is asked a choice task. Usually this is which of the alternatives they prefer. In this example, the Choice task is 'forced'. An 'unforced' choice would allow the respondents to also select 'Neither'. The choice task is used as the dependent variable in the resulting choice model
Citizen survey
18
Citizen survey
A citizen survey is a kind of opinion poll which typically asks the residents of a specific jurisdiction for their perspectives on local issues, such as the quality of life in the community, their level of satisfaction with local government, or their political leanings. Such a survey can be conducted by mail, telephone, Internet, or in person. Citizen surveys were advanced by Harry Hatry[1] of the Urban Institute, who believed resident opinions to be as necessary to the actions of local government managers and elected officials as customer surveys are to business executives. Local government officials use the data from citizen surveys to assist them in allocating resources for maximum community benefit and forming strategic plans for community programs and policies. Many private firms and universities also conduct their own citizen surveys for similar purposes. In 1991, the International City and County Manager's Association (ICMA)[2] published a book by Thomas Miller and Michelle Miller Kobayashi titled Citizen Surveys: How To Do Them, How To Use Them, and What They Mean, that directed local government officials in the basic methods for conducting citizen surveys. The book was revised and republished in 2000. In 2001, ICMA partnered with Miller and Kobayashi's organization National Research Center, Inc.,[3] to bring The National Citizen Survey, a low-cost survey service, to local governments. National Research Center, Inc. maintains a database of over 500 jurisdictions representing more than 40 million Americans, allowing local governments to compare their cities' results with similar communities nearby or across the nation.
References
[1] Selected Research - http:/ / www. urban. org/ expert. cfm?ID=HarryPHatry [2] Untitled Document (http:/ / www. icma. org) [3] National Research Center-Specializing in Performance Measurement and Evaluation (http:/ / www. n-r-c. com)
History
Classical Test Theory was born only after the following 3 achievements or ideas were conceptualized: one, a recognition of the presence of errors in measurements, two, a conception of that error as a random variable, and third, a conception of correlation and how to index it. In 1904, Charles Spearman was responsible for figuring out how to correct a correlation coefficient for attenuation due to measurement error and how to obtain the index of reliability needed in making the correction.[1] Spearman's finding is thought to be the beginning of Classical Test Theory by some (Traub, 1997). Others who had an influence in the Classical Test Theory's framework include: George Udny Yule, Truman Lee Kelley, those involved in making the Kuder-Richardson Formulas, Louis Guttman,
Classical test theory and, most recently, Melvin Novick, not to mention others over the next quarter century after Spearman's initial findings
19
Definitions
Classical test theory assumes that each person has a true score,T, that would be obtained if there were no errors in measurement. A person's true score is defined as the expected number-correct score over an infinite number of independent administrations of the test. Unfortunately, test users never observe a person's true score, only an observed score, X. It is assumed that observed score = true score plus some error: X observed score = T true score + E error , , and in the population.
Classical test theory is concerned with the relations between the three variables
These relations are used to say something about the quality of test scores. In this regard, the most important concept is that of reliability. The reliability of the observed test scores , which is denoted as , is defined as the ratio of true score variance to the observed score variance :
Because the variance of the observed scores can be shown to equal the sum of the variance of true scores and the variance of error scores, this is equivalent to
This equation, which formulates a signal-to-noise ratio, has intuitive appeal: The reliability of test scores becomes higher as the proportion of error variance in the test scores becomes lower and vice versa. The reliability is equal to the proportion of the variance in the test scores that we could explain if we knew the true scores. The square root of the reliability is the correlation between true and observed scores.
and
Under these assumptions, it follows that the correlation between parallel test scores is equal to reliability (see Lord & Novick, 1968, Ch. 2, for a proof).
Using parallel tests to estimate reliability is cumbersome because parallel tests are very hard to come by. In practice the method is rarely used. Instead, researchers use a measure of internal consistency known as Cronbach's . Consider a test consisting of items , . The total test score is defined as the sum of the individual item scores, so that for individual
20
Cronbach's can be shown to provide a lower bound for reliability under rather mild assumptions. Thus, the reliability of test scores in a population is always higher than the value of Cronbach's in that population. Thus, this method is empirically feasible and, as a result, it is very popular among researchers. Calculation of Cronbach's is included in many standard statistical packages such as SPSS and SAS.[] As has been noted above, the entire exercise of classical test theory is done to arrive at a suitable definition of reliability. Reliability is supposed to say something about the general quality of the test scores in question. The general idea is that, the higher reliability is, the better. Classical test theory does not say how high reliability is supposed to be. Too high a value for , say over .9, indicates redundancy of items. Around .8 is recommended for personality research, while .9+ is desirable for individual high-stakes testing.[2] These 'criteria' are not based on formal arguments, but rather are the result of convention and professional practice. The extent to which they can be mapped to formal principles of statistical inference is unclear.
Alternatives
Classical test theory is an influential theory of test scores in the social sciences. In psychometrics, the theory has been superseded by the more sophisticated models in Item Response Theory (IRT) and Generalizability theory (G-theory). However, IRT is not included in standard statistical packages like SPSS and SAS, whereas these packages routinely provide estimates of Cronbach's . Specialized psychometric software is necessary for IRT or G-theory. However, general statistical packages often do not provide a complete classical analysis (Cronbach's is only one of many important statistics), and in many cases, specialized software for classical analysis is also necessary.
Classical test theory examinees. However, as Hambleton explains in his book, scores on any test are unequally precise measures for examinees of different ability, thus making the assumption of equal errors of measurement for all examinees implausible (Hambleton, Swaminathan, Rogers, 1991, p.4). A fourth, and final shortcoming of the Classical Test Theory is that it is test oriented, rather than item oriented. In other words, Classical Test Theory cannot help us make predictions of how well an individual or even a group of examinees might do on a test item. [4]
21
Notes
[1] Traub, R. (1997). Classical Test Theory in Historical Perspective. Educational Measurement: Issues and Practice, 16 (4), 8-14. doi:doi:10.1111/j.1745-3992.1997.tb00603.x [3] Hambleton, R., Swaminathan, H., Rogers, H. (1991). Fundamentals of Item Response Theory. Newbury Park, California: Sage Publications, Inc. [4] Hambleton, R., Swaminathan, H., Rogers, H. (1991). Fundamentals of Item Response Theory. Newbury Park, California: Sage Publications, Inc.
References
Allen, M.J., & Yen, W. M. (2002). Introduction to Measurement Theory. Long Grove, IL: Waveland Press. Novick, M.R. (1966) The axioms and principal results of classical test theory Journal of Mathematical Psychology Volume 3, Issue 1, February 1966, Pages 1-18 Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Reading MA: Addison-Welsley Publishing Company
Further reading
Gregory, Robert J. (2011). Psychological Testing: History, Principles, and Applications (Sixth ed.). Boston: Allyn & Bacon. ISBN978-0-205-78214-7. Lay summary (http://www.pearsonhighered.com/bookseller/ product/Psychological-Testing-History-Principles-and-Applications-6E/9780205782147.page) (7 November 2010). Hogan, Thomas P.; Brooke Cannon (2007). Psychological Testing: A Practical Introduction (Second ed.). Hoboken (NJ): John Wiley & Sons. ISBN978-0-471-73807-7. Lay summary (http://www.wiley.com/ WileyCDA/WileyTitle/productCd-EHEP000675.html) (21 November 2010).
External links
International Test Commission article on Classical Test Theory (http://www.intestcom.org/Publications/ ORTA/Classical+test+theory.php)
22
Examples
The diagram below illustrates the results of a survey that studied drinkers perceptions of spirits (alcohol). Each point represents the results from one respondent. The research indicates there are four clusters in this market. The axes represent two traits of the market. In more complex cluster analyses you may have more than that number.
Illustration of clusters Another example is the vacation travel market. Recent research has identified three clusters or market segments. They are the: 1) The demanders - they want exceptional service and expect to be pampered; 2) The escapists - they want to get away and just relax; 3) The educationalist - they want to see new things, go to museums, go on a safari, or experience new cultures. Cluster analysis, like factor analysis and multi-dimensional scaling, is an interdependence technique: it makes no distinction between dependent and independent variables. The entire set of interdependent relationships is examined. It is similar to multi-dimensional scaling in that both examine inter-object similarity by examining the complete set of interdependent relationships. The difference is that multi-dimensional scaling identifies underlying dimensions, while cluster analysis identifies clusters. Cluster analysis is the obverse of factor analysis. Whereas factor analysis reduces the number of variables by grouping them into a smaller set of factors, cluster analysis reduces the number of observations or cases by grouping them into a smaller set of clusters.
23
Procedure
1. Formulate the problem - select the variables to which you wish to apply the clustering technique 2. Select a distance measure - various ways of computing distance: Squared Euclidean distance - the sum of the squared differences in value for each variable Manhattan distance - the sum of the absolute differences in value for any variable Chebyshev distance - the maximum absolute difference in values for any variable Mahalanobis (or correlation) distance - this measure uses the correlation coefficients between the observations and uses that as a measure to cluster them. This is an important measure since it is unit invariant (can figuratively compare apples to oranges) Select a clustering procedure (see below) Decide on the number of clusters Map and interpret clusters - draw conclusions - illustrative techniques like perceptual maps, icicle plots, and dendrograms are useful Assess reliability and validity - various methods: repeat analysis but use different distance measure repeat analysis but use different clustering technique split the data randomly into two halves and analyze each part separately repeat analysis several times, deleting one variable each time repeat analysis several times, using a different order each time
3. 4. 5. 6.
Clustering procedures
There are several types of clustering methods: Non-Hierarchical clustering (also called k-means clustering) first determine a cluster center, then group all objects that are within a certain distance examples: Sequential Threshold method - first determine a cluster center, then group all objects that are within a predetermined threshold from the center - one cluster is created at a time Parallel Threshold method - simultaneously several cluster centers are determined, then objects that are within a predetermined threshold from the centers are grouped Optimizing Partitioning method - first a non-hierarchical procedure is run, then objects are reassigned so as to optimize an overall criterion. Hierarchical clustering objects are organized into an hierarchical structure as part of the procedure examples: Divisive clustering - start by treating all objects as if they are part of a single large cluster, then divide the cluster into smaller and smaller clusters Agglomerative clustering - start by treating each object as a separate cluster, then group them into bigger and bigger clusters examples: Centroid methods - clusters are generated that maximize the distance between the centers of clusters (a centroid is the mean value for all the objects in the cluster) Variance methods - clusters are generated that minimize the within-cluster variance example:
Cluster analysis (in marketing) Wards Procedure - clusters are generated that minimize the squared Euclidean distance to the center mean Linkage methods - cluster objects based on the distance between them examples: Single Linkage method - cluster objects based on the minimum distance between them (also called the nearest neighbour rule) Complete Linkage method - cluster objects based on the maximum distance between them (also called the furthest neighbour rule) Average Linkage method - cluster objects based on the average distance between all pairs of objects (one member of the pair must be from a different cluster)
24
References
Sheppard, A. G. (1996). "The sequence of factor analysis and cluster analysis: Differences in segmentation and dimensionality through the use of raw and factor scores". Tourism Analysis. 1 (Inaugural Volume): 4957.
Unlike conventional psychometric ability and IQ tests, which primarily measure crystallised ability in specific content domains, the CPP measures information processing tendencies and capabilities. It also measures 'fluid intelligence' and 'learning potential', by tracking information processing in unfamiliar and fuzzy environments. The CPP predicts cognitive performance in complex, dynamic and vague (or VUCA) work contexts such as professional, strategic and executive environments. It was developed by Dr S M Prinsloo, founder of Cognadev, and released in 1994. Since then it has been translated into several languages and applied internationally for the purposes of leadership assessment, succession planning, selection and development, team compilation as well as personal and team development within the corporate environment.
25
References
Thompson, D. (2008) Themes of Measurement and Prediction, in Business Psychology in Practice (ed P. Grant), Whurr Publishers Ltd, London, UK. Print ISBN 978-1-86156-476-4 Online ISBN 978-0-470-71328-0
External links
Cognadev developer of the CPP [1]
Further reading
Jacques, Elliott. (1988) Requisite Organisations,Cason Hall & Co, Arlington,VA. ISBN 1-886436-03-7 Beer, Stafford. The Viable System Model: Its Provenance, Development, Methodology and Pathology, The Journal of the Operational Research Society Vol. 35, No. 1 (Jan., 1984), pp.725
References
[1] http:/ / www. cognadev. com/ products. aspx?pid=1/
Common-method variance
In applied statistics, (e.g., applied to the social sciences and psychometrics), common-method variance (CMV) is the spurious "variance that is attributable to the measurement method rather than to the constructs the measures represent"[] or equivalently as "systematic error variance shared among variables measured with and introduced as a function of the same method and/or source".[] Studies affected by CMV or common-method bias suffer from false correlations and run the risk of reporting incorrect research results.[]
Remedies
Ex-ante remedies
Several ex ante remedies exist that help to avoid or minimize possible common method variance. Important remedies have been collected by Chang et al. (2010).[]
Ex-post remedies
Using simulated data sets, Richardson et al. (2009) investigate three ex post techniques to test for common method variance: the correlational marker technique, the confirmatory factor analysis (CFA) marker technique, and the unmeasured latent method construct (ULMC) technique. Only the CFA marker technique turns out to provide some value.[] A comprehensive example of this technique has been demonstrated by Williams et al. (2010).[]
References
26
References
[1] Luecht, R.M. (2005). Some useful cost-benefit criteria for evaluating computer-based test delivery models and systems. Journal of Applied Testing Technology, 7(2). (http:/ / www. testpublishers. org/ Documents/ JATT2005_rev_Criteria4CBT_RMLuecht_Apr2005. pdf) [2] Luecht, R. M. & Nungester, R. J. (1998). Some practical examples of computer-adaptive sequential testing. Journal of Educational Measurement, 35, 229-249.
Computerized adaptive testing issue of Applied Measurement in Education [5] for more information on MST.
27
Advantages
Adaptive tests can provide uniformly precise scores for most test-takers.[2] In contrast, standard fixed tests almost always provide the best precision for test-takers of medium ability and increasingly poorer precision for test-takers with more extreme test scores. An adaptive test can typically be shortened by 50% and still maintain a higher level of precision than a fixed version.[1] This translates into a time savings for the test-taker. Test-takers do not waste their time attempting items that are too hard or trivially easy. Additionally, the testing organization benefits from the time savings; the cost of examinee seat time is substantially reduced. However, because the development of a CAT involves much more expense than a standard fixed-form test, a large population is necessary for a CAT testing program to be financially fruitful. Like any computer-based test, adaptive tests may show results immediately after testing. Adaptive testing, depending on the item selection algorithm, may reduce exposure of some items because examinees typically receive different sets of items rather than the whole population being administered a single set. However, it may increase the exposure of others (namely the medium or medium/easy items presented to most examinees at the beginning of the test).[2]
Disadvantages
The first issue encountered in CAT is the calibration of the item pool. In order to model the characteristics of the items (e.g., to pick the optimal item), all the items of the test must be pre-administered to a sizable sample and then analyzed. To achieve this, new items must be mixed into the operational items of an exam (the responses are recorded but do not contribute to the test-takers' scores), called "pilot testing," "pre-testing," or "seeding."[2] This presents logistical, ethical, and security issues. For example, it is impossible to field an operational adaptive test with brand-new, unseen items;[6] all items must be pretested with a large enough sample to obtain stable item statistics. This sample may be required to be as large as 1,000 examinees.[6] Each program must decide what percentage of the test can reasonably be composed of unscored pilot test items. Although adaptive tests have exposure control algorithms to prevent overuse of a few items,[2] the exposure conditioned upon ability is often not controlled and can easily become close to 1. That is, it is common for some items to become very common on tests for people of the same ability. This is a serious security concern because groups sharing items may well have a similar functional ability level. In fact, a completely randomized exam is the most secure (but also least efficient). Review of past items is generally disallowed. Adaptive tests tend to administer easier items after a person answers incorrectly. Supposedly, an astute test-taker could use such clues to detect incorrect answers and correct them. Or, test-takers could be coached to deliberately pick wrong answers, leading to an increasingly easier test. After tricking the adaptive test into building a maximally easy exam, they could then review the items and answer them correctlypossibly achieving a very high score. Test-takers frequently complain about the inability to review.[7] Because of the sophistication, the development of a CAT has a number of prerequisites.[8] The large sample sizes (typically hundreds of examinees) required by IRT calibrations must be present. Items must be scorable in real time if a new item is to be selected instantaneously. Psychometricians experienced with IRT calibrations and CAT simulation research are necessary to provide validity documentation. Finally, a software system capable of true IRT-based CAT must be available.
28
CAT components
There are five technical components in building a CAT (the following is adapted from Weiss & Kingsbury, 1984[1] ). This list does not include practical issues, such as item pretesting or live field release. 1. 2. 3. 4. 5. Calibrated item pool Starting point or entry level Item selection algorithm Scoring procedure Termination criterion
Starting Point
In CAT, items are selected based on the examinee's performance up to a given point in the test. However, the CAT is obviously not able to make any specific estimate of examinee ability when no items have been administered. So some other initial estimate of examinee ability is necessary. If some previous information regarding the examinee is known, it can be used,[1] but often the CAT just assumes that the examinee is of average ability - hence the first item often being of medium difficulty.
Scoring Procedure
After an item is administered, the CAT updates its estimate of the examinee's ability level. If the examinee answered the item correctly, the CAT will likely estimate their ability to be somewhat higher, and vice versa. This is done by using the item response function from item response theory to obtain a likelihood function of the examinee's ability. Two methods for this are called maximum likelihood estimation and Bayesian estimation. The latter assumes an a priori distribution of examinee ability, and has two commonly used estimators: expectation a posteriori and maximum a posteriori. Maximum likelihood is equivalent to a Bayes maximum a posteriori estimate if a uniform (f(x)=1) prior is assumed.[6] Maximum likelihood is asymptotically unbiased, but cannot provide a theta estimate for a nonmixed (all correct or incorrect) response vector, in which case a Bayesian method may have to be used temporarily.[1]
29
Termination Criterion
The CAT algorithm is designed to repeatedly administer items and update the estimate of examinee ability. This will continue until the item pool is exhausted unless a termination criterion is incorporated into the CAT. Often, the test is terminated when the examinee's standard error of measurement falls below a certain user-specified value, hence the statement above that an advantage is that examinee scores will be uniformly precise or "equiprecise."[1] Other termination criteria exist for different purposes of the test, such as if the test is designed only to determine if the examinee should "Pass" or "Fail" the test, rather than obtaining a precise estimate of their ability.[1][9]
Other issues
Pass-Fail CAT
In many situations, the purpose of the test is to classify examinees into two or more mutually exclusive and exhaustive categories. This includes the common "mastery test" where the two classifications are "pass" and "fail," but also includes situations where there are three or more classifications, such as "Insufficient," "Basic," and "Advanced" levels of knowledge or competency. The kind of "item-level adaptive" CAT described in this article is most appropriate for tests that are not "pass/fail" or for pass/fail tests where providing good feedback is extremely important.) Some modifications are necessary for a pass/fail CAT, also known as a computerized classification test (CCT).[9] For examinees with true scores very close to the passing score, computerized classification tests will result in long tests while those with true scores far above or below the passing score will have shortest exams. For example, a new termination criterion and scoring algorithm must be applied that classifies the examinee into a category rather than providing a point estimate of ability. There are two primary methodologies available for this. The more prominent of the two is the sequential probability ratio test (SPRT).[10][11] This formulates the examinee classification problem as a hypothesis test that the examinee's ability is equal to either some specified point above the cutscore or another specified point below the cutscore. Note that this is a point hypothesis formulation rather than a composite hypothesis formulation[12] that is more conceptually appropriate. A composite hypothesis formulation would be that the examinee's ability is in the region above the cutscore or the region below the cutscore. A confidence interval approach is also used, where after each item is administered, the algorithm determines the probability that the examinee's true-score is above or below the passing score.[13][14] For example, the algorithm may continue until the 95% confidence interval for the true score no longer contains the passing score. At that point, no further items are needed because the pass-fail decision is already 95% accurate, assuming that the psychometric models underlying the adaptive testing fit the examinee and test. This approach was originally called "adaptive mastery testing"[13] but it can be applied to non-adaptive item selection and classification situations of two or more cutscores (the typical mastery test has a single cutscore).[14] As a practical matter, the algorithm is generally programmed to have a minimum and a maximum test length (or a minimum and maximum administration time). Otherwise, it would be possible for an examinee with ability very close to the cutscore to be administered every item in the bank without the algorithm making a decision. The item selection algorithm utilized depends on the termination criterion. Maximizing information at the cutscore is more appropriate for the SPRT because it maximizes the difference in the probabilities used in the likelihood ratio.[15] Maximizing information at the ability estimate is more appropriate for the confidence interval approach because it minimizes the conditional standard error of measurement, which decreases the width of the confidence interval needed to make a classification.[14]
30
References
[1] Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21, 361-375. [2] Thissen, D., & Mislevy, R.J. (2000). Testing Algorithms. In Wainer, H. (Ed.) Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates. [3] Green, B.F. (2000). System design and operation. In Wainer, H. (Ed.) Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates. [4] http:/ / www. iacat. org/ [5] http:/ / www. leaonline. com/ toc/ ame/ 19/ 3 [6] Wainer, H., & Mislevy, R.J. (2000). Item response theory, calibration, and estimation. In Wainer, H. (Ed.) Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates. [7] http:/ / edres. org/ scripts/ cat/ catdemo. htm [8] http:/ / www. fasttestweb. com/ ftw-docs/ CAT_Requirements. pdf [9] Lin, C.-J. & Spray, J.A. (2000). Effects of item-selection criteria on classification testing with the sequential probability ratio test. (Research Report 2000-8). Iowa City, IA: ACT, Inc. [10] Wald, A. (1947). Sequential analysis. New York: Wiley. [11] Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press. [12] Weitzman, R. A. (1982). Sequential testing for selection. Applied Psychological Measurement, 6, 337-351. [13] Kingsbury, G.G., & Weiss, D.J. (1983). A comparison of IRT-based adaptive mastery testing and a sequential mastery testing procedure. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press. [14] Eggen, T. J. H. M, & Straetmans, G. J. J. M. (2000). Computerized adaptive testing for classifying examinees into three categories. Educational and Psychological Measurement, 60, 713-734. [15] Spray, J. A., & Reckase, M. D. (1994). The selection of test items for decision making with a computerized adaptive test. Paper presented at the Annual Meeting of the National Council for Measurement in Education (New Orleans, LA, April 57, 1994). [16] Sympson, B.J., & Hetter, R.D. (1985). Controlling item-exposure rates in computerized adaptive testing. Paper presented at the annual conference of the Military Testing Association, San Diego. [17] For example: van der Linden, W. J., & Veldkamp, B. P. (2004). Constraining item exposure in computerized adaptive testing with shadow tests. Journal of Educational and Behavioral Statistics, 29, 273291.
31
Additional sources
Drasgow, F., & Olson-Buchanan, J. B. (Eds.). (1999). Innovations in computerized assessment. Hillsdale, NJ: Erlbaum. Van der Linden, W. J., & Glas, C.A.W. (Eds.). (2000). Computerized adaptive testing: Theory and practice. Boston, MA: Kluwer. Wainer, H. (Ed.). (2000). Computerized adaptive testing: A Primer (2nd Edition). Mahwah, NJ: ELawrence Erlbaum Associates. Weiss, D.J. (Ed.). (1983). New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press.
Further reading
"First Adaptive Test: Binet's IQ Test" (http://iacat.org/node/442), International Association for Computerized Adaptive Testing (IACAT) Sands, William A. (Ed); Waters, Brian K. (Ed); McBride, James R. (Ed), Computerized adaptive testing: From inquiry to operation (http://psycnet.apa.org/books/10244/), Washington, DC, US: American Psychological Association. (1997). xvii 292 pp. doi: 10.1037/10244-000 Zara, Anthony R., "Using Computerized Adaptive Testing to Evaluate Nurse Competence for Licensure: Some History and Forward Look" (http://www.springerlink.com/content/mh6p73432451g446/), Advances in Health Sciences Education, Volume 4, Number 1 (1999), 39-48, DOI: 10.1023/A:1009866321381
External links
International Association for Computerized Adaptive Testing (http://www.iacat.org) Concerto: Open-source CAT Platform (http://www.psychometrics.cam.ac.uk/page/300/ concerto-testing-platform.htm) CAT Central (http://www.psych.umn.edu/psylabs/catcentral/) by David J. Weiss Frequently Asked Questions about Computer-Adaptive Testing (CAT) (http://www.carla.umn.edu/ assessment/CATfaq.html). Retrieved April 15, 2005. An On-line, Interactive, Computer Adaptive Testing Tutorial (http://edres.org/scripts/cat/catdemo.htm) by Lawrence L. Rudner. November 1998. Retrieved April 15, 2005. Special issue: An introduction to multistage testing. (http://www.leaonline.com/toc/ame/19/3) Applied Measurement in Education, 19(3). Computerized Adaptive Tests (http://www.ericdigests.org/pre-9213/tests.htm) - from the Education Resources Information Center Clearinghouse on Tests Measurement and Evaluation, Washington, DC
32
Psychometric Model
Two approaches are available for the psychometric model of a CCT: classical test theory (CTT) and item response theory (IRT). Classical test theory assumes a state model because it is applied by determining item parameters for a sample of examinees determined to be in each category. For instance, several hundred "masters" and several hundred "nonmasters" might be sampled to determine the difficulty and discrimination for each, but doing so requires that you be able to easily identify a distinct set of people that are in each group. IRT, on the other hand, assumes a trait model; the knowledge or ability measured by the test is a continuum. The classification groups will need to be more or less arbitrarily defined along the continuum, such as the use of a cutscore to demarcate masters and nonmasters, but the specification of item parameters assumes a trait model. There are advantages and disadvantages to each. CTT offers greater conceptual simplicity. More importantly, CTT requires fewer examinees in the sample for calibration of item parameters to be used eventually in the design of the CCT, making it useful for smaller testing programs. See Frick (1992)[3] for a description of a CTT-based CCT. Most CCTs, however, utilize IRT. IRT offers greater specificity, but the most important reason may be that the design of a CCT (and a CAT) is expensive, and is therefore more likely done by a large testing program with extensive resources. Such a program would likely use IRT.
33
Starting point
A CCT must have a specified starting point to enable certain algorithms. If the sequential probability ratio test is used as the termination criterion, it implicitly assumes a starting ratio of 1.0 (equal probability of the examinee being a master or nonmaster). If the termination criterion is a confidence interval approach, a specified starting point on theta must be specified. Usually, this is 0.0, the center of the distribution, but it could also be randomly drawn from a certain distribution if the parameters of the examinee distribution are known. Also, previous information regarding an individual examinee, such as their score the last time they took the test (if re-taking) may be used.
Item Selection
In a CCT, items are selected for administration throughout the test, unlike the traditional method of administering a fixed set of items to all examinees. While this is usually done by individual item, it can also be done in groups of items known as testlets (Leucht & Nungester, 1996;[4] Vos & Glas, 2000[5]). Methods of item selection fall into two categories: cutscore-based and estimate-based. Cutscore-based methods (also known as sequential selection) maximize the information provided by the item at the cutscore, or cutscores if there are more than one, regardless of the ability of the examinee. Estimate-based methods (also known as adaptive selection) maximize information at the current estimate of examinee ability, regardless of the location of the cutscore. Both work efficiently, but the efficiency depends in part on the termination criterion employed. Because the sequential probability ratio test only evaluates probabilities near the cutscore, cutscore-based item selection is more appropriate. Because the confidence interval termination criterion is centered around the examinees ability estimate, estimate-based item selection is more appropriate. This is because the test will make a classification when the confidence interval is small enough to be completely above or below the cutscore (see below). The confidence interval will be smaller when the standard error of measurement is smaller, and the standard error of measurement will be smaller when there is more information at the theta level of the examinee.
Termination criterion
There are three termination criteria commonly used for CCTs. Bayesian decision theory methods offer great flexibility by presenting an infinite choice of loss/utility structures and evaluation considerations, but also introduce greater arbitrariness. A confidence interval approach calculates a confidence interval around the examinee's current theta estimate at each point in the test, and classifies the examinee when the interval falls completely within a region of theta that defines a classification. This was originally known as adaptive mastery testing (Kingsbury & Weiss, 1983), but does not necessarily require adaptive item selection, nor is it limited to the two-classification mastery testing situation. The sequential probability ratio test (Reckase, 1983) defines the classification problem as a hypothesis test that the examinee's theta is equal to a specified point above the cutscore or a specified point below the cutscore.
34
References
[1] Thompson, N. A. (2007). A Practitioners Guide for Variable-length Computerized Classification Testing. Practical Assessment Research & Evaluation, 12(1). (http:/ / pareonline. net/ getvn. asp?v=12& n=1) [2] Parshall, C. G., Spray, J. A., Kalohn, J. C., & Davey, T. (2006). Practical considerations in computer-based testing. New York: Springer. [3] Frick, T. (1992). Computerized Adaptive Mastery Tests as Expert Systems. Journal of Educational Computing Research, 8(2), 187-213. [4] Luecht, R. M., & Nungester, R. J. (1998). Some practical examples of computer-adaptive sequential testing. Journal of Educational Measurement, 35, 229-249. [5] Vos, H.J. & Glas, C.A.W. (2000). Testlet-based adaptive mastery testing. In van der Linden, W.J., and Glas, C.A.W. (Eds.) Computerized Adaptive Testing: Theory and Practice.
Computerized classification test Kingsbury, G.G., & Weiss, D.J. (1983). A comparison of IRT-based adaptive mastery testing and a sequential mastery testing procedure. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp.237254). New York: Academic Press. Lau, C. A. (1996). Robustness of a unidimensional computerized testing mastery procedure with multidimensional testing data. Unpublished doctoral dissertation, University of Iowa, Iowa City IA. Lau, C. A., & Wang, T. (1998). Comparing and combining dichotomous and polytomous items with SPRT procedure in computerized classification testing. Paper presented at the annual meeting of the American Educational Research Association, San Diego. Lau, C. A., & Wang, T. (1999). Computerized classification testing under practical constraints with a polytomous model. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada. Lau, C. A., & Wang, T. (2000). A new item selection procedure for mixed item type in computerized classification testing. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, Louisiana. Lewis, C., & Sheehan, K. (1990). Using Bayesian decision theory to design a computerized mastery test. Applied Psychological Measurement, 14, 367-386. Lin, C.-J. & Spray, J.A. (2000). Effects of item-selection criteria on classification testing with the sequential probability ratio test. (Research Report 2000-8). Iowa City, IA: ACT, Inc. Linn, R. L., Rock, D. A., & Cleary, T. A. (1972). Sequential testing for dichotomous decisions. Educational & Psychological Measurement, 32, 85-95. Luecht, R. M. (1996). Multidimensional Computerized Adaptive Testing in a Certification or Licensure Context. Applied Psychological Measurement, 20, 389-404. Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp.237254). New York: Academic Press. Rudner, L. M. (2002). An examination of decision-theory adaptive testing procedures. Paper presented at the annual meeting of the American Educational Research Association, April 15, 2002, New Orleans, LA. Sheehan, K., & Lewis, C. (1992). Computerized mastery testing with nonequivalent testlets. Applied Psychological Measurement, 16, 65-76. Spray, J. A. (1993). Multiple-category classification using a sequential probability ratio test (Research Report 93-7). Iowa City, Iowa: ACT, Inc. Spray, J. A., Abdel-fattah, A. A., Huang, C., and Lau, C. A. (1997). Unidimensional approximations for a computerized test when the item pool and latent space are multidimensional (Research Report 97-5). Iowa City, Iowa: ACT, Inc. Spray, J. A., & Reckase, M. D. (1987). The effect of item parameter estimation error on decisions made using the sequential probability ratio test (Research Report 87-17). Iowa City, IA: ACT, Inc. Spray, J. A., & Reckase, M. D. (1994). The selection of test items for decision making with a computerized adaptive test. Paper presented at the Annual Meeting of the National Council for Measurement in Education (New Orleans, LA, April 57, 1994). Spray, J. A., & Reckase, M. D. (1996). Comparison of SPRT and sequential Bayes procedures for classifying examinees into two categories using a computerized test. Journal of Educational & Behavioral Statistics,21, 405-414. Thompson, N.A. (2006). Variable-length computerized classification testing with item response theory. CLEAR Exam Review, 17(2). Vos, H. J. (1998). Optimal sequential rules for computer-based instruction. Journal of Educational Computing Research, 19, 133-154.
35
Computerized classification test Vos, H. J. (1999). Applications of Bayesian decision theory to sequential mastery testing. Journal of Educational and Behavioral Statistics, 24, 271-292. Wald, A. (1947). Sequential analysis. New York: Wiley. Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21, 361-375. Weissman, A. (2004). Mutual information item selection in multiple-category classification CAT. Paper presented at the Annual Meeting of the National Council for Measurement in Education, San Diego, CA. Weitzman, R. A. (1982a). Sequential testing for selection. Applied Psychological Measurement, 6, 337-351. Weitzman, R. A. (1982b). Use of sequential testing to prescreen prospective entrants into military service. In D. J. Weiss (Ed.), Proceedings of the 1982 Computerized Adaptive Testing Conference. Minneapolis, MN: University of Minnesota, Department of Psychology, Psychometric Methods Program, 1982.
36
External links
Measurement Decision Theory (http://edres.org/mdt/) by Lawrence Rudner CAT Central (http://www.psych.umn.edu/psylabs/catcentral/) by David J. Weiss
Congruence coefficient
In multivariate statistics, the congruence coefficient is an index of the similarity between factors that have been derived in a factor analysis. It was introduced in 1948 by Cyril Burt who referred to it as unadjusted correlation. It is also called Tucker's congruence coefficient after Ledyard Tucker who popularized the technique. Its values range between -1 and +1. It can be used to study the similarity of extracted factors across different samples of, for example, test takers who have taken the same test.[1][2][3] Generally, a congruence coefficient of 0.90 is interpreted as indicating a high degree of factor similarity, while a coefficient of 0.95 or higher indicates that the factors are virtually identical. Alternatively, a value in the range 0.850.94 has been seen as corresponding to a fair similarity, with values higher than 0.95 indicating that the factors can be considered to be equal.[1][2]
Definition
Let X and Y be column vectors of factor loadings for two different samples. The formula for the congruence coefficient, or rc, is then[2]
The congruence coefficient can also be defined as the cosine of the angle between factor axes based on the same set of variables (e.g., tests) obtained for two samples (see Cosine similarity). For example, with perfect congruence the angle between the factor axes is 0 degrees, and the cosine of 0 is 1.[2]
Congruence coefficient
37
References
[1] Lorenzo-Seva, U. & ten Berge, J.M.F. (2006). Tuckers Congruence Coefficient as a Meaningful Index of Factor Similarity. Methodology, 2, 5764. [2] Jensen, A.R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger, pp. 99100. [3] Herv, A. (2007). RV Coefcient and Congruence Coefcient. (http:/ / wwwpub. utdallas. edu/ ~herve/ Abdi-RV2007-pretty. pdf) In Neil Salkind (Ed.), Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage.
Conjoint analysis
See also: Conjoint analysis (in marketing), Conjoint analysis (in healthcare), IDDEA, Rule Developing Experimentation, Value based pricing. Conjoint analysis, also called multi-attribute compositional models or stated preference analysis, is a statistical technique that originated in mathematical psychology. Today it is used in many of the social sciences and applied sciences including marketing, product management, and operations research. It is not to be confused with the theory of conjoint measurement.
Methodology
Conjoint analysis requires research participants to make a series of trade-offs. Analysis of these trade-offs will reveal the relative importance of component attributes. To improve the predictive ability of this analysis, research participants should be grouped into similar segments based on objectives, values and/or other factors. The exercise can be administered to survey respondents in a number of different ways. Traditionally it is administered as a ranking exercise and sometimes as a rating exercise (where the respondent awards each trade-off scenario a score indicating appeal). In more recent years it has become common practice to present the trade-offs as a choice exercise (where the respondent simply chooses the most preferred alternative from a selection of competing alternatives - particularly common when simulating consumer choices) or as a constant sum allocation exercise (particularly common in pharmaceutical market research, where physicians indicate likely shares of prescribing, and each alternative in the trade-off is the description a real or hypothetical therapy). Analysis is traditionally carried out with some form of multiple regression, but more recently the use of hierarchical Bayesian analysis has become widespread, enabling fairly robust statistical models of individual respondent decision behaviour to be developed. When there are many attributes, experiments with Conjoint Analysis include problems of information overload that affect the validity of such experiments. The impact of these problems can be avoided or reduced by using Hierarchical Information Integration.[1]
Conjoint analysis
38
Example
A real estate developer is interested in building a high rise apartment complex near an urban Ivy League university. To ensure the success of the project, a market research firm is hired to conduct focus groups with current students. Students are segmented by academic year (freshman, upper classmen, graduate studies) and amount of financial aid received. Study participants are given a series of index cards. Each card has 6 attributes to describe the potential building project (proximity to campus, cost, telecommunication packages, laundry options, floor plans, and security features offered). The estimated cost to construct the building described on each card is equivalent. Participants are asked to order the cards from least to most appealing. This forced ranking exercise will indirectly reveal the participants' priorities and preferences. Multi-variate regression analysis may be used to determine the strength of preferences across target market segments.
References
Background
Correlations between parameters are diluted or weakened by measurement error. Disattenuation provides for a more accurate estimate of the correlation between the parameters by accounting for this effect.
which, assuming the errors are uncorrelated with each other and with the estimates, gives
39
where as follows:
where the mean squared standard error of person estimate gives an estimate of the variance of the errors, standard errors are normally produced as a by-product of the estimation process (see Rasch model estimation). The disattenuated estimate of the correlation between two sets of parameters or measures is therefore
. The
That is, the disattenuated correlation is obtained by dividing the correlation between the estimates by the square root of the product of the separation indices of the two sets of estimates. Expressed in terms of Classical test theory, the correlation is divided by the square root of the product of the reliability coefficients of two tests. Given two random variables , the correlation between and and , with correlation , and a known reliability for each variable, . and
How well the variables are measured affects the correlation of X and Y. The correction for attenuation tells you what the correlation would be if you could measure X and Y with perfect reliability. If then and are taken to be imperfect measurements of underlying variables measures the true correlation between and . and with independent errors,
References
Jensen, A.R. (1998). The g Factor: The Science of Mental Ability Praeger, Connecticut, USA. ISBN 0-275-96103-6 Spearman, C. (1904) "The Proof and Measurement of Association between Two Things". The American Journal of Psychology, 15 (1), 72101 JSTOR1412159 [1]
External links
Disattenuating correlations [2] Disattenuation of correlation and regression coefficients: Jason W. Osborne [3]
References
[1] http:/ / www. jstor. org/ stable/ 1412159 [2] http:/ / www. rasch. org/ rmt/ rmt101g. htm [3] http:/ / pareonline. net/ getvn. asp?v=8& n=11
Counternull
40
Counternull
In statistics, and especially in the statistical analysis of psychological data, the counternull is a statistic used to aid the understanding and presentation of research results. It revolves around the effect size, which is the mean magnitude of some effect divided by the standard deviation.[1] The counternull value is the effect size that is just as well supported by the data as the null hypothesis.[2] In particular, when results are drawn from a distribution that is symmetrical about its mean, the counternull value is exactly twice the observed effect size. The null hypothesis is a hypothesis set up to be tested against an alternative. Thus the counternull is an alternative hypothesis that, when used to replace the null hypothesis, generates the same p-value as had the original null hypothesis of no difference.[3] Some researchers contend that reporting the counternull, in addition to the p-value, serves to counter two common errors of judgment:[] assuming that failure to reject the null hypothesis at the chosen level of statistical significance means that the observed size of the "effect" is zero; and assuming that rejection of the null hypothesis at a particular p-value means that the measured "effect" is not only statistically significant, but also scientifically important. These arbitrary statistical thresholds create a discontinuity, causing unnecessary confusion and artificial controversy.[4] Other researchers prefer confidence intervals as a means of countering these common errors.[5]
References
[4] Pasher (2002), p. 348: "The reject/fail-to-reject UNIQ-nowiki-0-0713b852e134e18e-QINU dichotomy keeps the field awash in confusion and artificial controversy."
Further reading
Rosnow, R. L., & Rosenthal, R. (1996). Computing contrasts, effect sizes, and counternulls on other people's published data: General procedures for research consumers. Psychological Methods, 1, 331-340
Criterion-referenced test
41
Criterion-referenced test
A criterion-referenced test is one that provides for translating test scores into a statement about the behavior to be expected of a person with that score or their relationship to a specified subject matter. Most tests and quizzes that are written by school teachers can be considered criterion-referenced tests. The objective is simply to see whether the student has learned the material. Criterion-referenced assessment can be contrasted with norm-referenced assessment and ipsative assessment. Criterion-referenced testing was a major focus of psychometric research in the 1970s.[1]
Definition of criterion
A common misunderstanding regarding the term is the meaning of criterion. Many, if not most, criterion-referenced tests involve a cutscore, where the examinee passes if their score exceeds the cutscore and fails if it does not (often called a mastery test). The criterion is not the cutscore; the criterion is the domain of subject matter that the test is designed to assess. For example, the criterion may be "Students should be able to correctly add two single-digit numbers," and the cutscore may be that students should correctly answer a minimum of 80% of the questions to pass. The criterion-referenced interpretation of a test score identifies the relationship to the subject matter. In the case of a mastery test, this does mean identifying whether the examinee has "mastered" a specified level of the subject matter by comparing their score to the cutscore. However, not all criterion-referenced tests have a cutscore, and the score can simply refer to a person's standing on the subject domain.[2] Again, the ACT is an example of this; there is no cutscore, it simply is an assessment of the student's knowledge of high-school level subject matter. Because of this common misunderstanding, criterion-referenced tests have also been called standards-based assessments by some education agencies,[3] as students are assessed with regards to standards that define what they "should" know, as defined by the state.[4]
Student #1: WWII was caused by Hitler and Germany invading Poland.
This answer is worse than Student #2's answer, but better than Student #3's answer. This answer is better than Student #1's and Student #3's answers.
Student #2: WWII was caused by multiple factors, including the Great Depression and the general economic situation, the rise of nationalism, fascism, and imperialist expansionism, and unresolved resentments related to WWI. The war in Europe began with the German invasion of Poland. Student #3: WWII was caused by the assassination of Archduke Ferdinand.
This answer is worse than Student #1's and Student #2's answers.
Both terms criterion-referenced and norm-referenced were originally coined by Robert Glaser.[5] Unlike a criterion-reference test, a norm-referenced test indicates whether the test-taker did better or worse than other people who took the test. For example, if the criterion is "Students should be able to correctly add two single-digit numbers," then reasonable test questions might look like " " or " " A criterion-referenced test would report the student's performance strictly according to whether the individual student correctly answered these questions. A
Criterion-referenced test norm-referenced test would report primarily whether this student correctly answered more questions compared to other students in the group. Even when testing similar topics, a test which is designed to accurately assess mastery may use different questions than one which is intended to show relative ranking. This is because some questions are better at reflecting actual achievement of students, and some test questions are better at differentiating between the best students and the worst students. (Many questions will do both.) A criterion-referenced test will use questions which were correctly answered by students who know the specific material. A norm-referenced test will use questions which were correctly answered by the "best" students and not correctly answered by the "worst" students (e.g. Cambridge University's pre-entry 'S' paper). Some tests can provide useful information about both actual achievement and relative ranking. The ACT provides both a ranking, and indication of what level is considered necessary to likely success in college.[6] Some argue that the term "criterion-referenced test" is a misnomer, since it can refer to the interpretation of the score as well as the test itself.[7] In the previous example, the same score on the ACT can be interpreted in a norm-referenced or criterion-referenced manner.
42
Examples
Driving tests are criterion-referenced tests, because their goal is to see whether the test taker is skilled enough to be granted a driver's license, not to see whether one test taker is more skilled than another test taker. Citizenship tests are usually criterion-referenced tests, because their goal is to see whether the test taker is sufficiently familiar with the new country's history and government, not to see whether one test taker is more knowledgeable than another test taker.
References
[2] (http:/ / www. questionmark. com/ us/ glossary. htm) QuestionMark Glossary [3] Assessing the Assessment of Outcomes Based Education (http:/ / www. apapdc. edu. au/ archive/ ASPA/ conference2000/ papers/ art_3_9. htm) by Dr Malcolm Venter. Cape Town, South Africa. "OBE advocates a criterion-based system, which means getting rid of the bell curve, phasing out grade point averages and comparative grading". [4] Homeschool World (http:/ / www. home-school. com/ exclusive/ standards. html): "The Education Standards Movement Spells Trouble for Private and Home Schools" [6] Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York: Harper & Row.
Cronbach's alpha
43
Cronbach's alpha
In statistics, Cronbach's (alpha)[] is a coefficient of internal consistency. It is commonly used as an estimate of the reliability of a psychometric test for a sample of examinees. It was first named alpha by Lee Cronbach in 1951, as he had intended to continue with further coefficients. The measure can be viewed as an extension of the KuderRichardson Formula 20 (KR-20), which is an equivalent measure for dichotomous items. Alpha is not robust against missing data. Several other Greek letters have been used by later researchers to designate other measures used in a similar context.[1] Somewhat related is the average variance extracted (AVE). This article discusses the use of in psychology, but Cronbach's alpha statistic is widely used in the social sciences, business, nursing, and other disciplines. The term item is used throughout this article, but items could be anything questions, raters, indicators of which one might ask to what extent they "measure the same thing." Items that are manipulated are commonly referred to as variables.
Definition
Suppose that we measure a quantity which is a sum of . Cronbach's is defined as components (K-items or testlets):
where
the variance of the observed total test scores, and can also be defined as
where
is as above,
between the components across the current sample of persons (that is, without including the variances of each component). The standardized Cronbach's alpha can be defined as
where
is as above and
of an upper triangular, or lower triangular, correlation matrix). Cronbach's is related conceptually to the SpearmanBrown prediction formula. Both arise from the basic classical test theory result that the reliability of test scores can be expressed as the ratio of the true-score and total-score (error plus true score) variances:
The theoretical value of alpha varies from zero to 1, since it is the ratio of two variances. However, depending on the estimation procedure used, estimates of alpha can take on any value less than or equal to 1, including negative values, although only positive values make sense.[3] Higher values of alpha are more desirable. Some professionals,[4] as a rule of thumb, require a reliability of 0.70 or higher (obtained on a substantial sample) before they will use an instrument. Obviously, this rule should be applied with caution when has been computed from items that systematically violate its assumptions.Wikipedia:Citing sources Furthermore, the appropriate degree of reliability depends upon the use of the instrument. For example, an instrument designed to be used as part of a battery of tests may be intentionally designed to be as short as possible, and therefore somewhat less reliable. Other
Cronbach's alpha situations may require extremely precise measures with very high reliabilities. In the extreme case of a two-item test, the SpearmanBrown prediction formula is more appropriate than Cronbach's alpha. [5] This has resulted in a wide variance of test reliability. In the case of psychometric tests, most fall within the range of 0.75 to 0.83 with at least one claiming a Cronbach's alpha above 0.90 (Nunnally 1978, page 245246).
44
Internal consistency
Cronbach's alpha will generally increase as the intercorrelations among test items increase, and is thus known as an internal consistency estimate of reliability of test scores. Because intercorrelations among test items are maximized when all items measure the same construct, Cronbach's alpha is widely believed to indirectly indicate the degree to which a set of items measures a single unidimensional latent construct. However, the average intercorrelation among test items is affected by skew just like any other average. Thus, whereas the modal intercorrelation among test items will equal zero when the set of items measures several unrelated latent constructs, the average intercorrelation among test items will be greater than zero in this case. Indeed, several investigators have shown that alpha can take on quite high values even when the set of items measures several unrelated latent constructs.[6][][7][8][9][10]As a result, alpha is most appropriately used when the items measure different substantive areas within a single construct. When the set of items measures more than one construct, coefficient omega_hierarchical is more appropriate.[][] Alpha treats any covariance among items as true-score variance, even if items covary for spurious reasons. For example, alpha can be artificially inflated by making scales which consist of superficial changes to the wording within a set of items or by analyzing speeded tests. A commonly accepted rule of thumb for describing internal consistency using Cronbach's alpha is as follows,[11][12] however, a greater number of items in the test can artificially inflate the value of alpha[6] and so this rule of thumb should be used with caution:
Cronbach's alpha Internal consistency 0.9 0.8 < 0.9 0.7 < 0.8 0.6 < 0.7 0.5 < 0.6 < 0.5 Excellent Good Acceptable Questionable Poor Unacceptable
Generalizability theory
Cronbach and others generalized some basic assumptions of classical test theory in their generalizability theory. If this theory is applied to test construction, then it is assumed that the items that constitute the test are a random sample from a larger universe of items. The expected score of a person in the universe is called the universe score, analogous to a true score. The generalizability is defined analogously as the variance of the universe scores divided by the variance of the observable scores, analogous to the concept of reliability in classical test theory. In this theory, Cronbach's alpha is an unbiased estimate of the generalizability. For this to be true the assumptions of essential -equivalence or parallelness are not needed. Consequently, Cronbach's alpha can be viewed as a measure of how well the sum score on the selected items capture the expected score in the entire domain, even if that domain is heterogeneous.
Cronbach's alpha
45
Intra-class correlation
Cronbach's alpha is said to be equal to the stepped-up consistency version of the intra-class correlation coefficient, which is commonly used in observational studies. But this is only conditionally true. In terms of variance components, this condition is, for item sampling: if and only if the value of the item (rater, in the case of rating) variance component equals zero. If this variance component is negative, alpha will underestimate the stepped-up intra-class correlation coefficient; if this variance component is positive, alpha will overestimate this stepped-up intra-class correlation coefficient.
Factor analysis
Cronbach's alpha also has a theoretical relation with factor analysis. As shown by Zinbarg, Revelle, Yovel and Li,[] alpha may be expressed as a function of the parameters of the hierarchical factor analysis model which allows for a general factor that is common to all of the items of a measure in addition to group factors that are common to some but not all of the items of a measure. Alpha may be seen to be quite complexly determined from this perspective. That is, alpha is sensitive not only to general factor saturation in a scale but also to group factor saturation and even to variance in the scale scores arising from variability in the factor loadings. Coefficient omega_hierarchical[][] has a much more straightforward interpretation as the proportion of observed variance in the scale scores that is due to the general factor common to all of the items comprising the scale.
Notes
[3] Ritter, N. (2010). "Understanding a widely misunderstood statistic: Cronbach's alpha". Paper presented at Southwestern Educational Research Association (SERA) Conference 2010, New Orleans, LA (ED526237). [6] Cortina, J.M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98104. [11] George, D., & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference. 11.0 update (4th ed.). Boston: Allyn & Bacon. [12] Kline, P. (1999). The handbook of psychological testing (2nd ed.). London: Routledge
Further Reading
Allen, M.J., & Yen, W. M. (2002). Introduction to Measurement Theory. Long Grove, IL: Waveland Press. Bland J.M., Altman D.G. (1997). Statistics notes: Cronbach's alpha (http://www.bmj.com/cgi/content/full/ 314/7080/572). BMJ 1997;314:572. Cronbach, Lee J., and Richard J. Shavelson. (2004). My Current Thoughts on Coefficient Alpha and Successor Procedures. Educational and Psychological Measurement 64, no. 3 (June 1): 391418. doi: 10.1177/0013164404266386 (http://dx.doi.org/10.1177/0013164404266386).
Cutscore
46
Cutscore
A cutscore, also known as a passing score or passing point, is a single point on a score continuum that differentiates between classifications along the continuum. The most common cutscore, that many are familiar with, is a score that differentiates between the classifications of "pass" and "fail" on a professional or educational test.
Setting a cutscore
Many tests with low stakes set cutscores arbitrarily; for example, an elementary school teacher my require students to correctly answer 60% of the items on a test to pass. However, for a high-stakes test with a cutscore to be legally defensible and meet the Standards for Educational and Psychological Testing, the cutscore must be set with a formal standard-setting study or equated to another form of the test.
Descriptive statistics
Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data.[1] Descriptive statistics are distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory.[2] Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example in a paper reporting on a study involving human subjects, there typically appears a table giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, and the proportion of subjects with related comorbidities. Descriptive statistics is also a set of brief descriptive coefficients that summarizes a given data set that represents either the entire population or a sample. The measures that describe the data set are measures of central tendency and measures of variability or dispersion. Measures of central tendency include the mean, median and mode, while measures of variability include the standard deviation (or variance), the minimum and maximum variables, kurtosis and skewness.[3]
Descriptive statistics any analysis, one should always consider the expectations of future events.[3]
47
Univariate analysis
Univariate analysis involves describing the distribution of a single variable, including its central tendency (including the mean, median, and mode) and dispersion (including the range and quantiles of the data-set, and measures of spread such as the variance and standard deviation). The shape of the distribution may also be described via indices such as skewness and kurtosis. Characteristics of a variable's distribution may also be depicted in graphical or tabular format, including histograms and stem-and-leaf display.
Bivariate analysis
When a sample consists of more than one variable, descriptive statistics may be used to describe the relationship between pairs of variables. In this case, descriptive statistics include: Cross-tabulations and contingency tables Graphical representation via scatterplots Quantitative measures of dependence Descriptions of conditional distributions
The main reason for differentiating univariate and bivariate is that bivariate analysis is not only simple descriptive analysis, but also it describes the relationship between two different variables.[4] Quantitative measures of dependence include correlation (such as Pearson's r when both variables are continuous, or Spearman's rho if one or both are not) and covariance (which reflects the scale variables are measured on). The slope, in regression analysis, also reflects the relationship between variables. The unstandardised slope indicates the unit change in the criterion variable for a one unit change in the predictor. The standardised slope indicates this change in standardised (z-score) units. Furthermore, analysts always ensure that such a sample used in data is a good representative of the whole population in highly skewed statistics, it is done by transforming those highly skewed data with a use of logarithm. Use of logarithm makes graphs more symmetrical and look more similar to Normal distribution, and it is mostly used to analyze data in molecular biology.[5]
References
[1] [2] [3] [4] [5] Mann, Prem S. (1995) Introductory Statistics, 2nd Edition, Wiley. ISBN 0-471-31009-3 Dodge, Y (2003) The Oxford Dictionary of Statistical Terms OUP. ISBN 0-19-850994-4 Investopedia, Descriptive Statistics Terms (http:/ / www. investopedia. com/ terms/ d/ descriptive_statistics. asp#axzz2DxCoTnMM) Earl R. Babbie, The Practice of Social Research", 12th edition, Wadsworth Publishing, 2009, ISBN 0-495-59841-0, pp. 436440 Todd G.Nick "Descriptive Statistics" p.47
External links
Descriptive Statistics Lecture: University of Pittsburgh Supercourse: http://www.pitt.edu/~super1/lecture/ lec0421/index.htm
48
Examples
Memory span Reaction time
References
[1] Human Cognitive Abilities: A Survey of Factor-Analytic Studies By John Bissell Carroll 1993 Cambridge University Press ISBN 0-521-38712-4 p11 [2] Arthur R. Jensen Process differences and individual differences in some cognitive tasks Intelligence, Volume 11, Issue 2, AprilJune 1987, Pages 107-136 [3] J. Grudnik and J. Kranzler, Meta-analysis of the relationship between intelligence and inspection time, Intelligence 29 (2001), pp. 523535.
Equating
49
Equating
Test equating traditionally refers to the statistical process of determining comparable scores on different forms of an exam.[1] It can be accomplished using either classical test theory or item response theory. In item response theory, equating is the process of equating the units and origins of two scales on which the abilities of students have been estimated from results on different tests. The process is analogous to equating degrees Fahrenheit with degrees Celsius by converting measurements from one scale to the other. The determination of comparable scores is a by-product of equating that results from equating the scales obtained from test results.
In item response theory, two different kinds of equating are horizontal and vertical equating.[2] Vertical equating refers to the process of equating tests administered to groups of students with different abilities, such as students in different grades (years of schooling).[3] Horizontal equating refers the equating of tests administered to groups with similar abilities; for example, two tests administered students in the same grade in two consecutive calendar years. Different tests are used to avoid practice effects. In terms of item response theory, equating is just a special case of the more general process of scaling, applicable when more than one test is used. In practice, though, scaling is often implemented separately for different tests and then the scales subsequently equated. A distinction is often made between two methods of equating; common person and common item equating. Common person equating involves the administration of two tests to a common group of persons. The mean and standard deviation of the scale locations of the groups on the two tests are equated using a linear transformation. Common item equating involves the use of a set of common items referred to as the anchor test embedded in two different tests. The mean item location of the common items is equated.
Figure 1: Test characteristic curves showing the relationship between total score and person location for two different tests in relation to a common scale. In this example a total of 37 on Assessment 1 equates to a total of 34.9 on Assessment 2 as shown by the vertical line
Equating
50
References
[1] Kolen, M.J., & Brennan, R.L. (1995). Test Equating. New York: Spring. [2] Baker, F. (1983). Comparison of ability metrics obtained under two latent trait theory procedures. Applied Psychological Measurement, 7, 97-110. [3] Baker, F. (1984). Ability metric transformations involved in vertical equating under item response theory. Applied Psychological Measurement, 8(3), 261-271.
External links
Equating and the SAT (http://www.collegeboard.com/student/testing/sat/scores/understanding/equating. html) Equating and AP Tests (http://collegeboard.com/student/testing/ap/exgrd_set.html) IRTEQ:Windows Application that Implements IRT Scaling and Equating (http://www.umass.edu/remp/ software/irteq/)
Factor analysis
51
Factor analysis
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. In other words, it is possible, for example, that variations in three or four observed variables mainly reflect the variations in fewer unobserved variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modeled as linear combinations of the potential factors, plus "error" terms. The information gained about the interdependencies between observed variables can be used later to reduce the set of variables in a dataset. Computationally this technique is equivalent to low rank approximation of the matrix of observed variables. Factor analysis originated in psychometrics, and is used in behavioral sciences, social sciences, marketing, product management, operations research, and other applied sciences that deal with large quantities of data. Factor analysis is related to principal models, including factor analysis, use while PCA is a descriptive statistical equivalence or otherwise of the two analysis).[citation needed] component analysis (PCA), but the two are not identical. Latent variable regression modelling techniques to test hypotheses producing error terms, technique.[] There has been significant controversy in the field over the techniques (see exploratory factor analysis versus principal components
Statistical model
Definition
Suppose we have a set of observable random variables, and with means , where . and Suppose for some unknown constants , where , we have unobserved random variables
Here, the are independently distributed error terms with zero mean and finite variance, which may not be the same for all . Let , so that we have
If we have
, and
. Each column of
and
denote values for one particular observation, and matrix Also we will impose the following assumptions on . 1. 2. 3. loading matrix. Suppose or and are independent.
(to make sure that the factors are uncorrelated) is defined as the factors, and , we have as the
Any solution of the above set of equations following the constraints for
or
Factor analysis Note that for any orthogonal matrix if we set and , the criteria for being factors and
52
factor loadings still hold. Hence a set of factors and factor loadings is identical only up to orthogonal transformation.
Example
The following example is for expository purposes, and should not be taken as being realistic. Suppose a psychologist proposes a theory that there are two kinds of intelligence, "verbal intelligence" and "mathematical intelligence", neither of which is directly observed. Evidence for the theory is sought in the examination scores from each of 10 different academic fields of 1000 students. If each student is chosen randomly from a large population, then each student's 10 scores are random variables. The psychologist's theory may say that for each of the 10 academic fields, the score averaged over the group of all students who share some common pair of values for verbal and mathematical "intelligences" is some constant times their level of verbal intelligence plus another constant times their level of mathematical intelligence, i.e., it is a combination of those two "factors". The numbers for a particular subject, by which the two kinds of intelligence are multiplied to obtain the expected score, are posited by the theory to be the same for all intelligence level pairs, and are called "factor loadings" for this subject. For example, the theory may hold that the average student's aptitude in the field of taxonomy is {10 the student's verbal intelligence} + {6 the student's mathematical intelligence}. The numbers 10 and 6 are the factor loadings associated with taxonomy. Other academic subjects may have different factor loadings. Two students having identical degrees of verbal intelligence and identical degrees of mathematical intelligence may have different aptitudes in taxonomy because individual aptitudes differ from average aptitudes. That difference is called the "error" a statistical term that means the amount by which an individual differs from what is average for his or her levels of intelligence (see errors and residuals in statistics). The observable data that go into factor analysis would be 10 scores of each of the 1000 students, a total of 10,000 numbers. The factor loadings and levels of the two kinds of intelligence of each student must be inferred from the data.
where xk,i is the ith student's score for the kth subject is the mean of the students' scores for the kth subject (assumed to be zero, for simplicity, in the example as described above, which would amount to a simple shift of the scale used) vi is the ith student's "verbal intelligence", mi is the ith student's "mathematical intelligence", are the factor loadings for the kth subject, for j = 1, 2. k,i is the difference between the ith student's score in the kth subject and the average score in the kth subject of all students whose levels of verbal and mathematical intelligence are the same as those of the ith student, In matrix notation, we have
Factor analysis is a 10 1 column vector of unobservable constants (in this case "constants" are quantities not differing from one individual student to the next; and "random variables" are those assigned to individual students; the randomness arises from the random way in which the students are chosen). Note that, is an outer product of with a 11000 row vector of ones, yielding a 10 1000 matrix of the elements of , L is a 10 2 matrix of factor loadings (unobservable constants, ten academic topics, each with two intelligence parameters that determine success in that topic), F is a 2 1,000 matrix of unobservable random variables (two intelligence parameters for each of 1000 students), is a 10 1,000 matrix of unobservable random variables. Observe that by doubling the scale on which "verbal intelligence"the first component in each column of Fis measured, and simultaneously halving the factor loadings for verbal intelligence makes no difference to the model. Thus, no generality is lost by assuming that the standard deviation of verbal intelligence is 1. Likewise for mathematical intelligence. Moreover, for similar reasons, no generality is lost by assuming the two factors are uncorrelated with each other. The "errors" are taken to be independent of each other. The variances of the "errors" associated with the 10 different subjects are not assumed to be equal. Note that, since any rotation of a solution is also a solution, this makes interpreting the factors difficult. See disadvantages below. In this particular example, if we do not know beforehand that the two types of intelligence are uncorrelated, then we cannot interpret the two factors as the two different types of intelligence. Even if they are uncorrelated, we cannot tell which factor corresponds to verbal intelligence and which corresponds to mathematical intelligence without an outside argument. The values of the loadings L, the averages , and the variances of the "errors" must be estimated given the observed data X and F (the assumption about the levels of the factors is fixed for a given F).
53
Practical implementation
Type of factor analysis
Exploratory factor analysis (EFA) is used to identify complex interrelationships among items and group items that are part of unified concepts.[] The researcher makes no "a priori" assumptions about relationships among factors.[] Confirmatory factor analysis (CFA) is a more complex approach that tests the hypothesis that the items are associated with specific factors.[] CFA uses structural equation modeling to test a measurement model whereby loading on the factors allows for evaluation of relationships between observed variables and unobserved variables.[] Structural equation modeling approaches can accommodate measurement error, and are less restrictive than least-squares estimation.[] Hypothesized models are tested against actual data, and the analysis would demonstrate loadings of observed variables on the latent variables (factors), as well as the correlation between the latent variables.[]
Types of factoring
Principal component analysis (PCA): PCA is a widely used method for factor extraction, which is the first phase of EFA.[] Factor weights are computed in order to extract the maximum possible variance, with successive factoring continuing until there is no further meaningful variance left.[] The factor model must then be rotated for analysis.[] Canonical factor analysis, also called Rao's canonical factoring, is a different method of computing the same model as PCA, which uses the principal axis method. Canonical factor analysis seeks factors which have the highest canonical correlation with the observed variables. Canonical factor analysis is unaffected by arbitrary rescaling of the data. Common factor analysis, also called principal factor analysis (PFA) or principal axis factoring (PAF), seeks the least number of factors which can account for the common variance (correlation) of a set of variables.
Factor analysis Image factoring: based on the correlation matrix of predicted variables rather than actual variables, where each variable is predicted from the others using multiple regression. Alpha factoring: based on maximizing the reliability of factors, assuming variables are randomly sampled from a universe of variables. All other methods assume cases to be sampled and variables fixed. Factor regression model: a combinatorial model of factor model and regression model; or alternatively, it can be viewed as the hybrid factor model,[] whose factors are partially known.
54
Terminology
Factor loadings: The factor loadings, also called component loadings in PCA, are the correlation coefficients between the variables (rows) and factors (columns). Analogous to Pearson's r, the squared factor loading is the percent of variance in that indicator variable explained by the factor. To get the percent of variance in all the variables accounted for by each factor, add the sum of the squared factor loadings for that factor (column) and divide by the number of variables. (Note the number of variables equals the sum of their variances as the variance of a standardized variable is 1.) This is the same as dividing the factor's eigenvalue by the number of variables. Interpreting factor loadings: By one rule of thumb in confirmatory factor analysis, loadings should be .7 or higher to confirm that independent variables identified a priori are represented by a particular factor, on the rationale that the .7 level corresponds to about half of the variance in the indicator being explained by the factor. However, the .7 standard is a high one and real-life data may well not meet this criterion, which is why some researchers, particularly for exploratory purposes, will use a lower level such as .4 for the central factor and .25 for other factors call loadings above .6 "high" and those below .4 "low". In any event, factor loadings must be interpreted in the light of theory, not by arbitrary cutoff levels. In oblique rotation, one gets both a pattern matrix and a structure matrix. The structure matrix is simply the factor loading matrix as in orthogonal rotation, representing the variance in a measured variable explained by a factor on both a unique and common contributions basis. The pattern matrix, in contrast, contains coefficients which just represent unique contributions. The more factors, the lower the pattern coefficients as a rule since there will be more common contributions to variance explained. For oblique rotation, the researcher looks at both the structure and pattern coefficients when attributing a label to a factor. Communality: The sum of the squared factor loadings for all factors for a given variable (row) is the variance in that variable accounted for by all the factors, and this is called the communality. The communality measures the percent of variance in a given variable explained by all the factors jointly and may be interpreted as the reliability of the indicator. Spurious solutions: If the communality exceeds 1.0, there is a spurious solution, which may reflect too small a sample or the researcher has too many or too few factors. Uniqueness of a variable: That is, uniqueness is the variability of a variable minus its communality. Eigenvalues:/Characteristic roots: The eigenvalue for a given factor measures the variance in all the variables which is accounted for by that factor. The ratio of eigenvalues is the ratio of explanatory importance of the factors with respect to the variables. If a factor has a low eigenvalue, then it is contributing little to the explanation of variances in the variables and may be ignored as redundant with more important factors. Eigenvalues measure the amount of variation in the total sample accounted for by each factor. Extraction sums of squared loadings: Initial eigenvalues and eigenvalues after extraction (listed by SPSS as "Extraction Sums of Squared Loadings") are the same for PCA extraction, but for other extraction methods, eigenvalues after extraction will be lower than their initial counterparts. SPSS also prints "Rotation Sums of Squared Loadings" and even for PCA, these eigenvalues will differ from initial and extraction eigenvalues, though their total will be the same.
Factor analysis Factor scores (also called component scores in PCA): are the scores of each case (row) on each factor (column). To compute the factor score for a given case for a given factor, one takes the case's standardized score on each variable, multiplies by the corresponding factor loading of the variable for the given factor, and sums these products. Computing factor scores allows one to look for factor outliers. Also, factor scores may be used as variables in subsequent modeling.
55
Rotation methods
The unrotated output maximises the variance accounted for by the first and subsequent factors, and forcing the factors to be orthogonal. This data-compression comes at the cost of having most items load on the early factors, and usually, of having many items load substantially on more than one factor. Rotation serves to make the output more understandable, by seeking so-called "Simple Structure": A pattern of loadings where items load most strongly on one factor, and much more weakly on the other factors. Rotations can be orthogonal or oblique (allowing the factors to correlate). Varimax rotation is an orthogonal rotation of the factor axes to maximize the variance of the squared loadings of a factor (column) on all the variables (rows) in a factor matrix, which has the effect of differentiating the original variables by extracted factor. Each factor will tend to have either large or small loadings of any particular variable. A
Factor analysis varimax solution yields results which make it as easy as possible to identify each variable with a single factor. This is the most common rotation option. However, the orthogonality (i.e., independence) of factors is often an unrealistic assumption. Oblique rotations are inclusive of orthogonal rotation, and for that reason, oblique rotations are a preferred method.[3] Quartimax rotation is an orthogonal alternative which minimizes the number of factors needed to explain each variable. This type of rotation often generates a general factor on which most variables are loaded to a high or medium degree. Such a factor structure is usually not helpful to the research purpose. Equimax rotation is a compromise between Varimax and Quartimax criteria. Direct oblimin rotation is the standard method when one wishes a non-orthogonal (oblique) solution that is, one in which the factors are allowed to be correlated. This will result in higher eigenvalues but diminished interpretability of the factors. See below.Wikipedia:Please clarify Promax rotation is an alternative non-orthogonal (oblique) rotation method which is computationally faster than the direct oblimin method and therefore is sometimes used for very large datasets.
56
Applications in psychology
Factor analysis is used to identify "factors" that explain a variety of results on different tests. For example, intelligence research found that people who get a high score on a test of verbal ability are also good on other tests that require verbal abilities. Researchers explained this by using factor analysis to isolate one factor, often called crystallized intelligence or verbal intelligence, which represents the degree to which someone is able to solve problems involving verbal skills. Factor analysis in psychology is most often associated with intelligence research. However, it also has been used to find factors in a broad range of domains such as personality, attitudes, beliefs, etc. It is linked to psychometrics, as it can assess the validity of an instrument by finding if the instrument indeed measures the postulated factors.
Factor analysis
57
Advantages
Reduction of number of variables, by combining two or more variables into a single factor. For example, performance at running, ball throwing, batting, jumping and weight lifting could be combined into a single factor such as general athletic ability. Usually, in an item by people matrix, factors are selected by grouping related items. In the Q factor analysis technique, the matrix is transposed and factors are created by grouping related people: For example, liberals, libertarians, conservatives and socialists, could form separate groups. Identification of groups of inter-related variables, to see how they are related to each other. For example, Carroll used factor analysis to build his Three Stratum Theory. He found that a factor called "broad visual perception" relates to how good an individual is at visual tasks. He also found a "broad auditory perception" factor, relating to auditory task capability. Furthermore, he found a global factor, called "g" or general intelligence, that relates to both "broad visual perception" and "broad auditory perception". This means someone with a high "g" is likely to have both a high "visual perception" capability and a high "auditory perception" capability, and that "g" therefore explains a good part of why someone is good or bad in both of those domains.
Disadvantages
"...each orientation is equally acceptable mathematically. But different factorial theories proved to differ as much in terms of the orientations of factorial axes for a given solution as in terms of anything else, so that model fitting did not prove to be useful in distinguishing among theories." (Sternberg, 1977[]). This means all rotations represent different underlying processes, but all rotations are equally valid outcomes of standard factor analysis optimization. Therefore, it is impossible to pick the proper rotation using factor analysis alone. Factor analysis can be only as good as the data allows. In psychology, where researchers often have to rely on less valid and reliable measures such as self-reports, this can be problematic. Interpreting factor analysis is based on using a "heuristic", which is a solution that is "convenient even if not absolutely true".[4] More than one interpretation can be made of the same data factored the same way, and factor analysis cannot identify causality.
Factor analysis of the common factor model. The lack of Heywood cases in the PCA approach may mean that such issues pass unnoticed.[] 4. Researchers gain extra information from a PCA approach, such as an individuals score on a certain component such information is not yielded from factor analysis. However, as Fabrigar et al. contend, the typical aim of factor analysis i.e. to determine the factors accounting for the structure of the correlations between measured variables does not require knowledge of factor scores and thus this advantage is negated.[] It is also possible to compute factor scores from a factor analysis.
58
Factor analysis
59
Information collection
The data collection stage is usually done by marketing research professionals. Survey questions ask the respondent to rate a product sample or descriptions of product concepts on a range of attributes. Anywhere from five to twenty attributes are chosen. They could include things like: ease of use, weight, accuracy, durability, colourfulness, price, or size. The attributes chosen will vary depending on the product being studied. The same question is asked about all the products in the study. The data for multiple products is coded and input into a statistical program such as R, SPSS, SAS, Stata, STATISTICA, JMP, and SYSTAT.
Analysis
The analysis will isolate the underlying factors that explain the data using a matrix of associations.[5] Factor analysis is an interdependence technique. The complete set of interdependent relationships is examined. There is no specification of dependent variables, independent variables, or causality. Factor analysis assumes that all the rating data on different attributes can be reduced down to a few important dimensions. This reduction is possible because some attributes may be related to each other. The rating given to any one attribute is partially the result of the influence of other attributes. The statistical algorithm deconstructs the rating (called a raw score) into its various components, and reconstructs the partial scores into underlying factor scores. The degree of correlation between the initial raw score and the final factor score is called a factor loading.
Advantages
Both objective and subjective attributes can be used provided the subjective attributes can be converted into scores. Factor analysis can identify latent dimensions or constructs that direct analysis may not. It is easy and inexpensive.
Disadvantages
Usefulness depends on the researchers' ability to collect a sufficient set of product attributes. If important attributes are excluded or neglected, the value of the procedure is reduced. If sets of observed variables are highly similar to each other and distinct from other items, factor analysis will assign a single factor to them. This may obscure factors that represent more interesting relationships. Wikipedia:Please clarify Naming factors may require knowledge of theory because seemingly dissimilar attributes can correlate strongly for unknown reasons.
Factor analysis
60
Implementation
Factor analysis has been implemented in several statistical analysis programs since the 1980s: SAS, BMDP and SPSS.[10] It is also implemented in the R programming language (with the factanal function) and in OpenOpt. Rotations are implemented in the GPArotation R package.
References
[2] * [4] Richard B. Darlington (2004) [5] Ritter, N. (2012). A comparison of distribution-free and non-distribution free methods in factor analysis. Paper presented at Southwestern Educational Research Association (SERA) Conference 2012, New Orleans, LA (ED529153).
Further reading
Child, Dennis (2006). The Essentials of Factor Analysis (http://books.google.com/books?id=rQ2vdJgohH0C) (3rd ed.). Continuum International. ISBN978-0-8264-8000-2. Fabrigar, L.R.; Wegener, D.T.; MacCallum, R.C.; Strahan, E.J. (September 1999). "Evaluating the use of exploratory factor analysis in psychological research" (http://psycnet.apa.org/journals/met/4/3/272/). Psychological Methods 4 (3): 272299. doi: 10.1037/1082-989X.4.3.272 (http://dx.doi.org/10.1037/ 1082-989X.4.3.272). Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts and applications. Washington DC: American Psychological Association. ISBN1591470935.
External links
Factor Analysis. Retrieved July 23, 2004, from http://www2.chass.ncsu.edu/garson/pa765/factor.htm Raymond Cattell. Retrieved July 22, 2004, from http://www.indiana.edu/~intell/rcattell.shtml Exploratory Factor Analysis - A Book Manuscript by Tucker, L. & MacCallum R. (1993). Retrieved June 8, 2006, from: http://www.unc.edu/~rcm/book/factornew.htm Garson, G. David, "Factor Analysis," from Statnotes: Topics in Multivariate Analysis. Retrieved on April 13, 2009 from http://www2.chass.ncsu.edu/garson/pa765/statnote.htm Factor Analysis at 100 (http://www.fa100.info/index.html) conference material FARMS - Factor Analysis for Robust Microarray Summarization, an R package (http://www.bioinf.jku.at/ software/farms/farms.html) software
61
Trends in research
Studies of body dissatisfaction have shown that women have a tendency to pick a smaller IBS than current body size.[3] Discrepancies between the two selections indicate body dissatisfaction, which can lead to eating disorders or depression.
References
[1] Grogan, S. (2009). Routledge: New York. [2] International Journal of Eating Disorders (http:/ / www3. interscience. wiley. com/ journal/ 112417746/ abstract?CRETRY=1& SRETRY=0) [3] Cororve Fingeret, M., Gleaves, D., & Pearson, C. (2004). On the Methodology of Body Image Assessment: the use of figural rating scales to evaluate body dissatisfaction and the ideal body standards of women. Body Image, 2, 207-212
Fuzzy concept
A fuzzy concept is a concept of which the meaningful content, value, or boundaries of application can vary considerably according to context or conditions, instead of being fixed once and for all.[1] This generally means the concept is vague, lacking a fixed, precise meaning, without however being meaningless altogether.[2] It has a meaning, or multiple meanings (it has different semantic associations). But these can become clearer only through further elaboration and specification, including a closer definition of the context in which they are used. Fuzzy concepts "lack clarity and are difficult to test or operationalize".[3] In logic, fuzzy concepts are often regarded as concepts which in their application, or formally speaking, are neither completely true nor completely false, or which are partly true and partly false; they are ideas which require further elaboration, specification or qualification to understand their applicability (the conditions under which they truly make sense). In mathematics and statistics, a fuzzy variable (such as "the temperature", "hot" or "cold") is a value which could lie in a probable range defined by quantitative limits or parameters, and which can be usefully described with imprecise categories (such as "high", "medium" or "low"). In mathematics and computer science, the gradations of applicable meaning of a fuzzy concept are described in terms of quantitative relationships defined by logical operators. Such an approach is sometimes called "degree-theoretic semantics" by logicians and philosophers,[4] but the more usual term is fuzzy logic or many-valued logic. The basic idea is, that a real number is assigned to each statement written in a language, within a range from 0 to 1, where 1 means that the statement is completely true, and 0 means that the statement is completely false, while values less than 1 but greater than 0 represent that the statements are "partly true", to a given, quantifiable extent. This makes its possible to analyze a distribution of statements for their truth-content, identify data patterns, make inferences and predictions, and model how processes operate. Fuzzy reasoning (i.e. reasoning with graded concepts) has many practical uses.[5] It is nowadays widely used in the programming of vehicle and transport electronics, household appliances, video games, language filters, robotics, and various kinds of electronic equipment used for pattern recognition, surveying and monitoring (such as radars). Fuzzy reasoning is also used in artificial intelligence and virtual intelligence research.[6] "Fuzzy risk scores" are used by project managers and portfolio managers to express risk assessments.[7]
Fuzzy concept
62
Fuzzy concept
63
Uncertainty
Fuzzy concepts can generate uncertainty because they are imprecise (especially if they refer to a process in motion, or a process of transformation where something is "in the process of turning into something else"). In that case, they do not provide a clear orientation for action or decision-making ("what does X really mean or imply?"); reducing fuzziness, perhaps by applying fuzzy logic, would generate more certainty. However, this is not necessarily always so.[18] A concept, even although it is not fuzzy at all, and even though it is very exact, could equally well fail to capture the meaning of something adequately. That is, a concept can be very precise and exact, but not - or insufficiently - applicable or relevant in the situation to which it refers. In this sense, a definition can be "very precise", but "miss the point" altogether. A fuzzy concept may indeed provide more security, because it provides a meaning for something when an exact concept is unavailable - which is better than not being able to denote it at all. A concept such as God, although not easily definable, for instance can provide security to the believer.
Language
Ordinary language, which uses symbolic conventions and associations which are often not logical, inherently contains many fuzzy concepts - "knowing what you mean" in this case depends on knowing the context or being familiar with the way in which a term is normally used, or what it is associated with. This can be easily verified for instance by consulting a dictionary, a thesaurus or an encyclopedia which show the multiple meanings of words, or by observing the behaviours involved in ordinary relationships which rely on mutually understood meanings. To communicate, receive or convey a message, an individual somehow has to bridge his own intended meaning and the meanings which are understood by others, i.e. the message has to be conveyed in a way that it will be socially understood, preferably in the intended manner. Thus, people might state: "you have to say it in a way that I understand". This may be done instinctively, habitually or unconsciously, but it usually involves a choice of terms, assumptions or symbols whose meanings may often not be completely fixed, but which depend among other things on how the receiver of the message responds to it, or the context. In this sense, meaning is often "negotiated" or "interactive" (or, more cynically, manipulated). This gives rise to many fuzzy concepts. But even using ordinary set theory and binary logic[19] to reason something out, logicians have discovered that it is possible to generate statements which are logically speaking not completely true or imply a paradox,[20] even although in other respects they conform to logical rules.
Psychology
The origin of fuzzy concepts is partly due to the fact that the human brain does not operate like a computer (see also Chinese room).[21] While computers use strict binary logic gates, the brain does not; i.e., it is capable of making all kinds of neural associations according to all kinds of ordering principles (or fairly chaotically) in associative patterns which are not logical but nevertheless meaningful. For example, a work of art can be meaningful without being logical. Something can be meaningful although we cannot name it, or we might only be able to name it and nothing else. The human brain can also interpret the same phenomenon in several different but interacting frames of reference, at the same time, or in quick succession, without there necessarily being an explicit logical connection between the frames. In part, fuzzy concepts arise also because learning or the growth of understanding involves a transition from a vague awareness, which cannot orient behaviour greatly, to clearer insight, which can orient behaviour. For example, the Dutch theologian Kees de Groot explores the imprecise notion that psychotherapy is like an "implicit religion",
Fuzzy concept defined as a "fuzzy concept" (it all depends on what one means by "psychotherapy" and "religion").[22] Some logicians argue that fuzzy concepts are a necessary consequence of the reality that any kind of distinction we might like to draw has limits of application. As a certain level of generality, a distinction works fine. But if we pursued its application in a very exact and rigorous manner, or overextend its application, it appears that the distinction simply does not apply in some areas or contexts, or that we cannot fully specify how it should be drawn. An analogy might be that zooming a telescope, camera, or microscope in and out reveals that a pattern which is sharply focused at a certain distance disappears at another distance (or becomes blurry). Faced with any large, complex and continually changing phenomenon, any short statement made about that phenomenon is likely to be "fuzzy", i.e. it is meaningful, but - strictly speaking - incorrect and imprecise. It will not really do justice to the reality of what is happening with the phenomenon. A correct, precise statement would require a lot of elaborations and qualifiers. Nevertheless, the "fuzzy" description turns out to be a useful shorthand that saves a lot of time in communicating what is going on ("you know what I mean"). In psychophysics it has been discovered that the perceptual distinctions we draw in the mind are often more sharply defined than they are in the real world. Thus, the brain actually tends to "sharpen up" our perceptions of differences in the external world. Between black and white, we are able to detect only a limited number of shades of gray, or colour gradations. If there are more gradations and transitions in reality than our conceptual distinctions can capture, then it could be argued, that how those distinctions will actually apply, must necessarily become vaguer at some point. If, for example, one wants to count and quantify distinct objects using numbers, one needs to be able to distinguish between those separate objects, but if this is difficult or impossible, then, although this may not invalidate a quantitative procedure as such, quantification is not really possible in practice; at best, we may be able to assume or infer indirectly a certain distribution of quantities. Finally, in interacting with the external world, the human mind may often encounter new, or partly new phenomena or relationships which cannot (yet) be sharply defined given the background knowledge available, and by known distinctions, associations or generalizations. "Crisis management plans cannot be put 'on the fly' after the crisis occurs. At the outset, information is often vague, even contradictory. Events move so quickly that decision makers experience a sense of loss of control. Often denial sets in, and managers unintentionally cut off information flow about the situation" - L. Paul Bremer, "Corporate governance and crisis management", in: Directors & Boards, Winter 2002 It also can be argued that fuzzy concepts are generated by a certain sort of lifestyle or way of working which evades definite distinctions, makes them impossible or inoperable, or which is in some way chaotic. To obtain concepts which are not fuzzy, it must be possible to test out their application in some way. But in the absence of any relevant clear distinctions, or when everything is "in a state of flux" or in transition, it may not be possible to do so, so that the amount of fuzziness increases.
64
Applications
Fuzzy concepts often play a role in the creative process of forming new concepts to understand something. In the most primitive sense, this can be observed in infants who, through practical experience, learn to identify, distinguish and generalise the correct application of a concept, and relate it to other concepts.[23] However, fuzzy concepts may also occur in scientific, journalistic, programming and philosophical activity, when a thinker is in the process of clarifying and defining a newly emerging concept which is based on distinctions which, for one reason or another, cannot (yet) be more exactly specified or validated. Fuzzy concepts are often used to denote complex phenomena, or to describe something which is developing and changing, which might involve shedding some old meanings and acquiring new ones. In politics, it can be highly important and problematic how exactly a conceptual distinction is drawn, or indeed whether a distinction is drawn at all; distinctions used in administration may be deliberately sharpened, or kept
Fuzzy concept fuzzy, due to some political motive or power relationship. A politician may be deliberately vague about some things, and very clear and explicit about others. The "fuzzy area" can also refer simply to a residual number of cases which cannot be allocated to a known and identifiable group, class or set. In translation work, fuzzy concepts are analyzed for the purpose of good translation. A concept in one language may not have quite the same meaning or significance in another language, or it may not be feasible to translate it literally, or at all. Some languages have concepts which do not exist in another language, raising the problem of how one would most easily render their meaning. In computer-assisted translation, a technique called fuzzy matching is used to find the most likely translation of a piece of text, using previous translated texts as a basis. In information services fuzzy concepts are frequently encountered because a customer or client asks a question about something which could be interpreted in many different ways, or, a document is transmitted of a type or meaning which cannot be easily allocated to a known type or category, or to a known procedure. It might take considerable inquiry to "place" the information, or establish in what framework it should be understood. In the legal system, it is essential that rules are interpreted and applied in a standard way, so that the same cases and the same circumstances are treated equally. Otherwise one would be accused of arbitrariness, which would not serve the interests of justice. Consequently, lawmakers aim to devise definitions and categories which are sufficiently precise that they are not open to different interpretations. For this purpose, it is critically important to remove fuzziness, and differences of interpretation are typically resolved through a court ruling based on evidence. Alternatively, some other procedure is devised which permits the correct distinction to be discovered and made. In statistical research, it is an aim to measure the magnitudes of phenomena. For this purpose, phenomena have to be grouped and categorized so that distinct and discrete counting units can be defined. It must be possible to allocate all observations to mutually exclusive categories so that they are properly quantifiable. Survey observations do not spontaneously transform themselves into countable data; they have to be identified, categorized and classified in such a way that they are not counted twice or more. Again, for this purpose it is a requirement that the concepts used are exactly defined, and not fuzzy. There could be a margin of error, but the amount of error must be kept within tolerable limits, and preferably its magnitude should be known. In hypnotherapy, fuzzy language is deliberately used for the purpose of trance induction. Hypnotic suggestions are often couched in a somewhat vague, general or ambiguous language requiring interpretation by the subject. The intention is to distract and shift the conscious awareness of the subject away from external reality to his own internal state. In response to the somewhat confusing signals he gets, the awareness of the subject spontaneously tends to withdraw inward, in search of understanding or escape.[24] In biology, protein complexes with multiple structural forms are called fuzzy complexes. The different conformations can result in different, even opposite functions. The conformational ensemble is modulated by the environmental conditions. Post-translational modifications or alternative splicing can also impact the ensemble and thereby affinity or specificity of interactions. In theology an attempt is made to define more precisely the meaning of spiritual concepts, which refer to how human beings construct the meaning of human existence, and, often, the relationship people have with a supernatural world. Many spiritual concepts and beliefs are fuzzy, to the extent that, although abstract, they often have a highly personalized meaning, or involve personal interpretation of a type that is not easy to define in a cut-and-dried way. In meteorology, where changes and effects of complex interactions in the atmosphere are studied, the weather reports often use fuzzy expressions indicating a broad trend, likelihood or level. The main reason is that the forecast can rarely be totally exact for any given location. In phenomenology which studies the structure of subjective experience, an important insight is that how someone experiences something can be influenced both by the influence of the thing being experienced itself, but also by
65
Fuzzy concept how the person responds to it. Thus, the actual experience the person has, is shaped by an "interactive object-subject relationship". To describe this experience, fuzzy categories are often necessary, since it is often impossible to predict or describe with great exactitude what the interaction will be, and how it is experienced. It could be argued that many concepts used fairly universally in daily life (e.g. "love" or "God" or "health" or "social") are inherently or intrinsically fuzzy concepts, to the extent that their meaning can never be completely and exactly specified with logical operators or objective terms, and can have multiple interpretations, which are in part exclusively subjective. Yet despite this limitation, such concepts are not meaningless. People keep using the concepts, even if they are difficult to define precisely. It may also be possible to specify one personal meaning for the concept, without however placing restrictions on a different use of the concept in other contexts (as when, for example, one says "this is what I mean by X" in contrast to other possible meanings). In ordinary speech, concepts may sometimes also be uttered purely randomly; for example a child may repeat the same idea in completely unrelated contexts, or an expletive term may be uttered arbitrarily. A feeling or sense is conveyed, without it being fully clear what it is about. Fuzzy concepts can be used deliberately to create ambiguity and vagueness, as an evasive tactic, or to bridge what would otherwise be immediately recognized as a contradiction of terms. They might be used to indicate that there is definitely a connection between two things, without giving a complete specification of what the connection is, for some or other reason. This could be due to a failure or refusal to be more precise. But it could also could be a prologue to a more exact formulation of a concept, or a better understanding. Fuzzy concepts could also simply be a practical method to describe something of which a complete description would be an unmanageably large undertaking, or very time-consuming; thus, a simplified indication of what is at issue is regarded as sufficient, although it is not exact. There is also such a thing as an "economy of distinctions", meaning that it is not helpful or efficient to use more detailed definitions than are really necessary for a given purpose. The provision of "too many details" could be disorienting and confusing, instead of being enlightening, while a fuzzy term might be sufficient to provide an orientation. The reason for using fuzzy concepts can therefore be purely pragmatic[19], if it is not feasible for practical purposes to provide "all the details" about the meaning of a shared symbol or sign. Thus people might say "I realize this is not exact, but you know what I mean" - they assume practically that stating all the details is not required for the purpose of the communication.
66
Analysis
In mathematical logic, computer programming, philosophy and linguistics fuzzy concepts can be analyzed and defined more accurately or comprehensively, by describing or modelling the concepts using the terms of fuzzy logic. More generally, techniques can be used such as: concretizing the concept by finding specific examples, illustrations or cases to which it applies. specifying a range of conditions to which the concept applies (for example, in computer programming of a procedure). classifying or categorizing all or most cases or uses to which the concept applies (taxonomy). probing the assumptions on which a concept is based, or which are associated with its use (Critical thought). identifying operational rules for the use of the concept, which cover all or most cases. allocating different applications of the concept to different but related sets (e.g. using Boolean logic). examining how probable it is that the concept applies, statistically or intuitively. examining the distribution or distributional frequency of (possibly different) uses of the concept. some other kind of measure or scale of the degree to which the concept applies. specifying a series of logical operators (an inferential system or algorithm) which captures all or most cases to which the concept applies. mapping or graphing the applications of the concept using some basic parameters.
Fuzzy concept applying a meta-language which includes fuzzy concepts in a more inclusive categorical system which is not fuzzy. reducing or restating fuzzy concepts in terms which are simpler or similar, and which are not fuzzy or less fuzzy. relating the fuzzy concept to other concepts which are not fuzzy or less fuzzy, or simply by replacing the fuzzy concept altogether with another, alternative concept which is not fuzzy yet "works exactly the same way". In this way, we can obtain a more exact understanding of the use of a fuzzy concept, and possibly decrease the amount of fuzziness. It may not be possible to specify all the possible meanings or applications of a concept completely and exhaustively, but if it is possible to capture the majority of them, statistically or otherwise, this may be useful enough for practical purposes. A process of defuzzification is said to occur, when fuzzy concepts can be logically described in terms of (the relationships between) fuzzy sets, An operationalization diagram, one method of clarifing fuzzy concepts. which makes it possible to define variations in the meaning or applicability of concepts as quantities. Effectively, qualitative differences may then be described more precisely as quantitative variations or quantitative variability (assigning a numerical value then denotes the magnitude of variation). The difficulty that can occur in judging the fuzziness of a concept can be illustrated with the question "Is this one of those?". If it is not possible to clearly answer this question, that could be because "this" (the object) is itself fuzzy and evades definition, or because "one of those" (the concept of the object) is fuzzy and inadequately defined. Thus, the source of fuzziness may be in the nature of the reality being dealt with, the concepts used to interpret it, or the way in which the two are being related by a person. It may be that the personal meanings which people attach to something are quite clear to the persons themselves, but that it is not possible to communicate those meanings to others except as fuzzy concepts.
67
References
[1] Susan Haack, Deviant logic, fuzzy logic: beyond the formalism. Chicago: University of Chicago Press, 1996. [2] Richard Dietz & Sebastiano Moruzzi (eds.), Cuts and clouds. Vagueness, Its Nature, and Its Logic. Oxford University Press, 2009. [3] Ann Markusen, "Fuzzy Concepts, Scanty Evidence, Policy Distance: The Case for Rigour and Policy Relevance in Critical Regional Studies." In: Regional Studies, Volume 37, Issue 6-7, 2003, pp. 701-717. [4] Roy T. Cook, A dictionary of philosophical logic. Edinburgh University Press, 2009, p. 84. [5] Kazuo Tanaka, An Introduction to Fuzzy Logic for Practical Applications. Springer, 1996; Constantin Zopounidis, Panos M. Pardalos & George Baourakis, Fuzzy Sets in Management, Economics and Marketing. Singapore; World Scientific Publishing Co. 2001. [7] Irem Dikmen, M. Talat Birgonal and Sedat Han, "Using fuzzy risk assessment to rate cost overrun risk in international construction projects." International Journal of Project Management, Vol. 25 No. 5, July 2007, pp. 494-505. [8] Susan Haack notes that Stanisaw Jakowski provided axiomatizations of many-valued logics in: Jakowski, "On the rules of supposition in formal logic. Studia Logica No. 1, 1934. (http:/ / www. logik. ch/ daten/ jaskowski. pdf) See Susan Haack, Philosophy of Logics. Cambridge University Press, 1978, p. 205. [9] Priyanka Kaushal, Neeraj Mohan and Parvinder S. Sandhu, "Relevancy of Fuzzy Concept in Mathematics". International Journal of Innovation, Management and Technology, Vol. 1, No. 3, August 2010. (http:/ / ijimt. org/ papers/ 58-M450. pdf)
Fuzzy concept
[10] Lotfi A. Zadeh, "Fuzzy sets". In: Information and Control, Vol. 8, June 1965, pp. 338353. (http:/ / www-bisc. cs. berkeley. edu/ Zadeh-1965. pdf) [11] Siegfried Gottwald, "Shaping the logic of fuzzy set theory". In: Cintula, Petr et al. (eds.), Witnessed years. Essays in honour of Petr Hjek. London: College Publications, 2009, pp. 193-208. (http:/ / www. uni-leipzig. de/ ~logik/ gottwald/ Hajek09. pdf) [12] Radim Belohlavek, "What is a fuzzy concept lattice? II", in: Sergei O. Kuznetsov et al. (eds.), Rough sets, fuzzy sets, data mining and granular computing. Berlin: Springer Verlag, 2011, pp. 19-20. (http:/ / belohlavek. inf. upol. cz/ publications/ BeVy_Wifcl. pdf) [13] George Lakoff, "Hedges: A Study in Meaning Criteria and the Logic of Fuzzy Concepts." Journal of Philosophical Logic, Vol. 2, 1973, pp. 458-508. (http:/ / georgelakoff. files. wordpress. com/ 2011/ 01/ hedges-a-study-in-meaning-criteria-and-the-logic-of-fuzzy-concepts-journal-of-philosophical-logic-2-lakoff-19731. pdf) [14] Charles Ragin, Redesigning Social Inquiry: Fuzzy Sets and Beyond. University of Chicago Press, 2008. Shaomin Li, "Measuring the fuzziness of human thoughts: An application of fuzzy sets to sociological research". The Journal of Mathematical Sociology, Volume 14, Issue 1, 1989, pp. 67-84. [15] Jrg Rssel and Randall Collins, "Conflict theory and interaction rituals. The microfoundations of conflict theory." In: Jonathan H. Turner (ed.), Handbook of Sociological Theory. New York: Springer, 2001, p. 527. [16] Loc Wacquant, "The fuzzy logic of practical sense." in: Pierre Bourdieu and Loc Wacquant, An invitation to reflexive sociology. London: Polity Press, 1992, chapter I section 4. [17] Ph. Manning Fuzzy Description: Discovery and Invention in Sociology. In: History of the Human Sciences, Vol. 7, No. 1, 1994, pp. 117-23. [18] Masao Mukaidono, Fuzzy logic for beginners. Singapore: World Scientific Publishing, 2001. [19] http:/ / toolserver. org/ %7Edispenser/ cgi-bin/ dab_solver. py?page=Fuzzy_concept& editintro=Template:Disambiguation_needed/ editintro& client=Template:Dn [20] Patrick Hughes & George Brecht, Vicious Circles and Infinity. An anthology of Paradoxes. Penguin Books, 1978. [21] See further Radim Belohlavek & George J. Klir (eds.) Concepts and Fuzzy Logic. MIT Press, 2011. John R. Searle, "Minds, brains and programs". The behavioral and brain sciences, Vol. 3, No. 3, 1980, pp. 417-457. [22] C.N. de Groot, "Sociology of religion looks at psychotherapy." Recherches sociologiques (Louvain-la-Neuve, Belgium), Vol. 29, No. 2, 1998, pp. 3-17 at p. 4. (http:/ / arno. uvt. nl/ show. cgi?fid=76988) [23] Philip J. Kelman & Martha E. Arterberry, The cradle of knowledge: development of perception in infancy. Cambridge, Mass.: The MIT Press, 2000. [24] Ronald A. Havens (ed.), The wisdom of Milton H. Erickson, Volume I: hypnosis and hypnotherapy. New York: Irvington Publishers, 1992, p. 106. Joseph O'Connor & John Seymour (ed.), Introducing neuro-linguistic programming. London: Thorsons, 1995, p. 116f.
68
External links
James F. Brule, Fuzzy systems tutorial (http://www.austinlinks.com/Fuzzy/tutorial.html) "Fuzzy Logic", Stanford Encyclopedia of Philosophy (http://plato.stanford.edu/entries/logic-fuzzy/)
G factor (psychometrics)
69
G factor (psychometrics)
Human intelligence
Abilities, traits and constructs
Abstract thought Communication Creativity Emotional intelligence g factor Intelligence quotient Knowledge Learning Memory Problem solving Reaction time Reasoning Understanding Visual processing Models and theories
CattellHornCarroll theory Fluid and crystallized intelligence Theory of multiple intelligences Three stratum theory Triarchic theory of intelligence PASS theory of intelligence Fields of study
Cognitive epidemiology Evolution of human intelligence Psychometrics Heritability of IQ Impact of health on intelligence Environment and intelligence Neuroscience and intelligence Race and intelligence Religiosity and intelligence
The g factor (short for "general factor") is a construct developed in psychometric investigations of cognitive abilities. It is a variable that summarizes positive correlations among different cognitive tasks, reflecting the fact that an individual's performance at one type of cognitive task tends to be comparable to his or her performance at other kinds of cognitive tasks. The g factor typically accounts for 40 to 50 percent of the variance in IQ test performance, and IQ scores are frequently regarded as estimates of individuals' standing on the g factor.[1] The terms IQ, general intelligence, general cognitive ability, general mental ability, or simply intelligence are often used interchangeably to refer to the common core shared by cognitive tests.[2] The existence of the g factor was originally proposed by the English psychologist Charles Spearman in the early years of the 20th century. He observed that children's performance ratings across seemingly unrelated school subjects were positively correlated, and reasoned that these correlations reflected the influence of an underlying general mental ability that entered into performance on all kinds of mental tests. Spearman suggested that all mental
G factor (psychometrics) performance could be conceptualized in terms of a single general ability factor, which he labeled g, and a large number of narrow task-specific ability factors. Today's factor models of intelligence typically represent cognitive abilities as a three-level hierarchy, where there are a large number of narrow factors at the bottom of the hierarchy, a handful of broad, more general factors at the intermediate level, and at the apex a single factor, referred to as the g factor, which represents the variance common to all cognitive tasks. Traditionally, research on g has concentrated on psychometric investigations of test data, with a special emphasis on factor analytic approaches. However, empirical research on the nature of g has also drawn upon experimental cognitive psychology and mental chronometry, brain anatomy and physiology, quantitative and molecular genetics, and primate evolution.[3] While the existence of g as a statistical regularity is well-established and uncontroversial, there is no consensus as to what causes the positive correlations between tests. Behavioral genetic research has established that the construct of g is highly heritable. It has a number of other biological correlates, including brain size. It is also a significant predictor of individual differences in many social outcomes, particularly in education and the world of work. The most widely accepted contemporary theories of intelligence incorporate the g factor.[4] However, critics of g have contended that an emphasis on g is misplaced and entails a devaluation of other important abilities.
70
Subtest intercorrelations in a sample of Scottish subjects who completed the WAIS-R battery. The subtests are Vocabulary, Similarities, Information, Comprehension, Picture arrangement, Block design, Arithmetic, Picture completion, Digit span, Object assembly, and Digit symbol. The bottom row shows the g loadings of each subtest.[6]
V V S I C PA .67 S I C PA BD A PC DSp OA DS
.72 .59
G factor (psychometrics)
71
PC .49 .52 .52 .46 .48 .45 .30 .14 .27 .56 .25 -
DSp .46 .40 .36 .36 .31 .32 .47 .23 OA .32 .40 .32 .29 .36 .58 .33 .41 DS g .32 .33 .26 .30 .28 .36 .28 .26 .83 .80 .80 .75 .70 .70 .68 .68
.56 .48
Mental tests may be designed to measure different aspects of cognition. Specific domains assessed by tests include mathematical skill, verbal fluency, spatial visualization, and memory, among others. However, individuals who excel at one type of test tend to excel at other kinds of tests, too, while those who do poorly on one test tend to do so on all tests, regardless of the tests' contents.[7] The English psychologist Charles Spearman was the first to describe this phenomenon.[8] In a famous research paper published in 1904[9], he observed that children's performance measures across seemingly unrelated school subjects were positively correlated. This finding has since been replicated numerous times. The consistent finding of universally positive correlation matrices of mental test results (or the "positive manifold"), despite large differences in tests' contents, has been described as "arguably the most replicated result in all psychology."[10] Zero or negative correlations between tests suggest the presence of sampling error or restriction of the range of ability in the sample studied.[11] Using factor analysis or related statistical methods, it is possible to compute a single common factor that can be regarded as a summary variable characterizing the correlations between all the different tests in a test battery. Spearman referred to this common factor as the general factor, or simply g. (By convention, g is always printed as a lower case italic.) Mathematically, the g factor is a source of variance among individuals, which entails that one cannot meaningfully speak of any one individual's mental abilities consisting of g or other factors to any specified degrees. One can only speak of an individual's standing on g (or other factors) compared to other individuals in a relevant population.[12][13][11] Different tests in a test battery may correlate with (or "load onto") the g factor of the battery to different degrees. These correlations are known as g loadings. An individual test taker's g factor score, representing his or her relative standing on the g factor in the total group of individuals, can be estimated using the g loadings. Full-scale IQ scores from a test battery will usually be highly correlated with g factor scores, and they are often regarded as estimates of g. For example, the correlations between g factor scores and full-scale IQ scores from Wechsler's tests have been found to be greater than .95.[14][11][1] The terms IQ, general intelligence, general cognitive ability, general mental ability, or simply intelligence are frequently used interchangeably to refer to the common core shared by cognitive tests.[2] The g loadings of mental tests are always positive and usually range between .10 and .90, with a mean of about .60 and a standard deviation of about .15. Raven's Progressive Matrices is among the tests with the highest g loadings, around .80. Tests of vocabulary and general information are also typically found to have high g loadings.[15][16] However, the g loading of the same test may vary somewhat depending on the composition of the test battery.[17] The complexity of tests and the demands they place on mental manipulation are related to the tests' g loadings. For example, in the forward digit span test the subject is asked to repeat a sequence of digits in the order of their presentation after hearing them once at a rate of one digit per second. The backward digit span test is otherwise the same except that the subject is asked to repeat the digits in the reverse order to that in which they were presented. The backward digit span test is more complex than the forward digit span test, and it has a significantly higher g loading. Similarly, the g loadings of arithmetic computation, spelling, and word reading tests are lower than those of arithmetic problem solving, text composition, and reading comprehension tests, respectively.[18][19] Test difficulty and g loadings are distinct concepts that may or may not be empirically related in any specific situation. Tests that have the same difficulty level, as indexed by the proportion of test items that are failed by test takers, may exhibit a wide range of g loadings. For example, tests of rote memory have been shown to have the same level of difficulty but considerably lower g loadings than many tests that involve reasoning.[20][21]
G factor (psychometrics)
72
Theories of g
While the existence of g as a statistical regularity is well-established and uncontroversial among experts, there is no consensus as to what causes the positive intercorrelations. Several explanations have been proposed.[22]
Sampling theory
The so-called sampling theory of g, originally developed by E.L. Thorndike and Godfrey Thomson, proposes that the existence of the positive manifold can be explained without reference to a unitary underlying capacity. According to this theory, there are a number of uncorrelated mental processes, and all tests draw upon different samples of these processes. The intercorrelations between tests are caused by an overlap between processes tapped by the tests.[27][28] Thus, the positive manifold arises due to a measurement problem, an inability to measure more fine-grained, presumably uncorrelated mental processes.[13] It has been shown that it is not possible to distinguish statistically between Spearman's model of g and the sampling model; both are equally able to account for intercorrelations among tests.[29] The sampling theory is also consistent with the observation that more complex mental tasks have higher g loadings, because more complex tasks are expected to involve a larger sampling of neural elements and therefore have more of them in common with other tasks.[30] Some researchers have argued that the sampling model invalidates g as a psychological concept, because the model suggests that g factors derived from different test batteries simply reflect the shared elements of the particular tests contained in each battery rather than a g that is common to all tests. Similarly, high correlations between different batteries could be due to them measuring the same set of abilities rather than the same ability.[31] Critics have argued that the sampling theory is incongruent with certain empirical findings. Based on the sampling theory, one might expect that related cognitive tests share many elements and thus be highly correlated. However, some closely related tests, such as forward and backward digit span, are only modestly correlated, while some seemingly completely dissimilar tests, such as vocabulary tests and Raven's matrices, are consistently highly correlated. Another problematic finding is that brain damage frequently leads to specific cognitive impairments rather than a general impairment one might expect based on the sampling theory.[32][13]
G factor (psychometrics)
73
Mutualism
The "mutualism" model of g proposes that cognitive processes are initially uncorrelated, but that the positive manifold arises during individual development due to mutual beneficial relations between cognitive processes. Thus there is no single process or capacity underlying the positive correlations between tests. During the course of development, the theory holds, any one particularly efficient process will benefit other processes, with the result that the processes will end up being correlated with one another. Thus similarly high IQs in different persons may stem from quite different initial advantages that they had.[33][13] Critics have argued that the observed correlations between the g loadings and the heritability coefficients of subtests are problematic for the mutualism theory.[34]
G factor (psychometrics)
74
Through factor rotation, it is, in principle, possible to produce an infinite number of different factor solutions that are mathematically equivalent in their ability to account for the intercorrelations among cognitive tests. These include solutions that do not contain a g factor. Thus factor analysis alone cannot establish what the underlying structure of intelligence is. In choosing between different factor solutions, researchers have to examine the results of factor analysis together with other information about the structure of cognitive abilities.[37]
An illustration of John B. Carroll's three stratum theory, an influential contemporary model of cognitive abilities. The broad abilities recognized by the model are fluid intelligence (Gf), crystallized intelligence (Gc), general memory and learning (Gy), broad visual perception (Gv), broad auditory perception (Gu), broad retrieval ability (Gr), broad cognitive speediness (Gs), and processing speed (Gt). Carroll regarded the broad abilities as different "flavors" of g.
There are many psychologically relevant reasons for preferring factor solutions that contain a g factor. These include the existence of the positive manifold, the fact that certain kinds of tests (generally the more complex ones) have consistently larger g loadings, the substantial invariance of g factors across different test batteries, the impossibility of constructing test batteries that do not yield a g factor, and the widespread practical validity of g as a predictor of individual outcomes. The g factor, together with group factors, best represents the empirically established fact that, on average, overall ability differences between individuals are greater than differences among abilities within individuals, while a factor solution with orthogonal factors without g obscures this fact. Moreover, g appears to be the most heritable component of intelligence.[38] Research utilizing the techniques of confirmatory factor analysis has also provided support for the existence of g.[37] A g factor can be computed from a correlation matrix of test results using several different methods. These include exploratory factor analysis, principal components analysis (PCA), and confirmatory factor analysis. Different factor-extraction methods produce highly consistent results, although PCA has sometimes been found to produce inflated estimates of the influence of g on test scores.[39][17] There is a broad contemporary consensus that cognitive variance between people can be conceptualized at three hierarchical levels, distinguished by their degree of generality. At the lowest, least general level there are a large number of narrow first-order factors; at a higher level, there are a relatively small number somewhere between five and ten of broad (i.e., more general) second-order factors (or group factors); and at the apex, there is a single third-order factor, g, the general factor common to all tests.[40][41][42] The g factor usually accounts for the majority of the total common factor variance of IQ test batteries.[43] Contemporary hierarchical models of intelligence include the three stratum theory and the CattellHornCarroll theory.[44]
G factor (psychometrics) score, while the uncorrelated non-g components will cancel each other out. Theoretically, the composite score of an infinitely large, diverse test battery would, then, be a perfect measure of g.[46] In contrast, L.L. Thurstone argued that a g factor extracted from a test battery reflects the average of all the abilities called for by the particular battery, and that g therefore varies from one battery to another and "has no fundamental psychological significance."[47] Along similar lines, John Horn argued that g factors are meaningless because they are not invariant across test batteries, maintaining that correlations between different ability measures arise because it is difficult to define a human action that depends on just one ability.[48][49] To show that different batteries reflect the same g, one must administer several test batteries to the same individuals, extract g factors from each battery, and show that the factors are highly correlated.[50] Wendy Johnson and colleagues have published two such studies.[51][52] The first found that the correlations between g factors extracted from three different batteries were .99, .99, and 1.00, supporting the hypothesis that g factors from different batteries are the same and that the identification of g is not dependent on the specific abilities assessed. The second study found that g factors derived from four of five test batteries correlated at between .951.00, while the correlations ranged from .79 to .96 for the fifth battery, the Cattell Culture Fair Intelligence Test (the CFIT). They attributed the somewhat lower correlations with the CFIT battery to its lack of content diversity for it contains only matrix-type items, and interpreted the findings as supporting the contention that g factors derived from different test batteries are the same provided that the batteries are diverse enough. The results suggest that the same g can be consistently identified from different test batteries.[53][40]
75
Population distribution
The form of the population distribution of g is unknown, because g cannot be measured on a ratio scale. (The distributions of scores on typical IQ tests are roughly normal, but this is achieved by construction, i.e., by appropriate item selection by test developers.) It has been argued that there are nevertheless good reasons for supposing that g is normally distributed in the general population, at least within a range of 2 standard deviations from the mean. In particular, g can be thought of as a composite variable that reflects the additive effects of a large number of independent genetic and environmental influences, and such a variable should, according to the central limit theorem, follow a normal distribution.[54]
G factor (psychometrics) ideal for examining SLDR. Tucker-Drob (2009)[59] extensively reviewed the literature on SLDR and the various methods by which it had been previously tested, and proposed that SLDR could be most appropriately captured by fitting a common factor model that allows the relations between the factor and its indicators to be nonlinear in nature. He applied such a factor model to a nationally representative data of children and adults in the United States and found consistent evidence for SLDR. For example, Tucker-Drob (2009) found that a general factor accounted for approximately 75% of the variation in seven different cognitive abilities among very low IQ adults, but only accounted for approximately 30% of the variation in the abilities among very high IQ adults.
76
Practical validity
The practical validity of g as a predictor of educational, economic, and social outcomes is more far-ranging and universal than that of any other known psychological variable. The validity of g is greater the greater the complexity of the task.[60][61] A test's practical validity is measured by its correlation with performance on some criterion external to the test, such as college grade-point average, or a rating of job performance. The correlation between test scores and a measure of some criterion is called the validity coefficient. One way to interpret a validity coefficient is to square it to obtain the variance accounted by the test. For example, a validity coefficient of .30 corresponds to 9 percent of variance explained. This approach has, however, been criticized as misleading and uninformative, and several alternatives have been proposed. One arguably more interpretable approach is to look at the percentage of test takers in each test score quintile who meet some agreed-upon standard of success. For example, if the correlation between test scores and performance is .30, the expectation is that 67 percent of those in the top quintile will be above-average performers, compared to 33 percent of those in the bottom quintile.[62][63]
Academic achievement
The predictive validity of g is most conspicuous in the domain of scholastic performance. This is apparently because g is closely linked to the ability to learn novel material and understand concepts and meanings.[64] In elementary school, the correlation between IQ and grades and achievement scores is between .60 and .70. At more advanced educational levels, more students from the lower end of the IQ distribution drop out, which restricts the range of IQs and results in lower validity coefficients. In high school, college, and graduate school the validity coefficients are .50.60, .40.50, and .30.40, respectively. The g loadings of IQ scores are high, but it is possible that some of the validity of IQ in predicting scholastic achievement is attributable to factors measured by IQ independent of g. According to research by Robert L. Thorndike, 80 to 90 percent of the predictable variance in scholastic performance is due to g, with the rest attributed to non-g factors measured by IQ and other tests.[65] Achievement test scores are more highly correlated with IQ than school grades. This may be because grades are more influenced by the teacher's idiosyncratic perceptions of the student.[66] In a longitudinal English study, g scores measured at age 11 correlated with all the 25 subject tests of the national GCSE examination taken at age 16. The correlations ranged from .77 for the mathematics test to .42 for the art test. The correlation between g and a general educational factor computed from the GCSE tests was .81.[67] Research suggests that the SAT, widely used in college admissions, is primarily a measure of g. A correlation of .82 has been found between g scores computed from an IQ test battery and SAT scores. In a study of 165,000 students at 41 U.S. colleges, SAT scores were found to be correlated at .47 with first-year college grade-point average after correcting for range restriction in SAT scores (when course difficulty is held constant, i.e., if all students attended the same set of classes, the correlation rises to .55).[62][68]
G factor (psychometrics)
77
Income
The correlation between income and g, as measured by IQ scores, averages about .40 across studies. The correlation is higher at higher levels of education and it increases with age, stabilizing when people reach their highest career potential in middle age. Even when education, occupation and socioeconomic background are held constant, the correlation does not vanish.[73]
Other correlates
The g factor is reflected in many social outcomes. Many social behavior problems, such as dropping out of school, chronic welfare dependency, accident proneness, and crime, are negatively correlated with g independent of social class of origin.[74] Health and mortality outcomes are also linked to g, with higher childhood test scores predicting better health and mortality outcomes in adulthood (see Cognitive epidemiology).[75]
G factor (psychometrics) low the heritability of each is). Genetic correlations between specific mental abilities (such as verbal ability and spatial ability) have been consistently found to be very high, close to 1.0. This indicates that genetic variation in cognitive abilities is almost entirely due to genetic variation in whatever g is. It also suggests that what is common among cognitive abilities is largely caused by genes, and that independence among abilities is largely due to environmental effects. Thus it has been argued that when genes for intelligence are identified, they will be "generalist genes", each affecting many different cognitive abilities.[77][79][80] The g loadings of mental tests have been found to correlate with their heritabilities, with correlations ranging from moderate to perfect in various studies. Thus the heritability of a mental test is usually higher the larger its g loading is.[34] Much research points to g being a highly polygenic trait influenced by a large number of common genetic variants, each having only small effects. Another possibility is that heritable differences in g are due to individuals having different "loads" of rare, deleterious mutations, with genetic variation among individuals persisting due to mutationselection balance.[80][81] A number of candidate genes have been reported to be associated with intelligence differences, but the effect sizes have been small and almost none of the findings have been replicated. No individual genetic variants have been conclusively linked to intelligence in the normal range so far. Many researchers believe that very large samples will be needed to reliably detect individual genetic polymorphisms associated with g.[40][81] However, while genes influencing variation in g in the normal range have proven difficult to find, a large number of single-gene disorders with mental retardation among their symptoms have been discovered.[82] Several studies suggest that tests with larger g loadings are more affected by inbreeding depression lowering test scores. There is also evidence that tests with larger g loadings are associated with larger positive heterotic effects on test scores. Inbreeding depression and heterosis suggest the presence of genetic dominance effects for g.[83]
78
Neuroscientific findings
g has a number of correlates in the brain. Studies using magnetic resonance imaging (MRI) have established that g and total brain volume are moderately correlated (r~.3.4). External head size has a correlation of ~.2 with g. MRI research on brain regions indicates that the volumes of frontal, parietal and temporal cortices, and the hippocampus are also correlated with g, generally at .25 or more, while the correlations, averaged over many studies, with overall grey matter and overall white matter have been found to be .31 and .27, respectively. Some but not all studies have also found positive correlations between g and cortical thickness. However, the underlying reasons for these associations between the quantity of brain tissue and differences in cognitive abilities remain largely unknown.[2] Most researchers believe that intelligence cannot be localized to a single brain region, such as the frontal lobe. It has been suggested that intelligence could be characterized as a small-world network. For example, high intelligence could be dependent on unobstructed transfer of information between the involved brain regions along white matter fibers. Brain lesion studies have found small but consistent associations indicating that people with more white matter lesions tend to have lower cognitive ability. Research utilizing NMR spectroscopy has discovered somewhat inconsistent but generally positive correlations between intelligence and white matter integrity, supporting the notion that white matter is important for intelligence.[2] Some research suggests that aside from the integrity of white matter, also its organizational efficiency is related to intelligence. The hypothesis that brain efficiency has a role in intelligence is supported by functional MRI research showing that more intelligent people generally process information more efficiently, i.e., they use fewer brain resources for the same task than less intelligent people.[2] Small but relatively consistent associations with intelligence test scores include also brain activity, as measured by EEG records or event-related potentials, and nerve conduction velocity.[84][85]
G factor (psychometrics)
79
G factor (psychometrics)
80
Working memory
One theory holds that g is identical or nearly identical to working memory capacity. Among other evidence for this view, some studies have found factors representing g and working memory to be perfectly correlated. However, in a meta-analysis the correlation was found to be considerably lower.[95] One criticism that has been made of studies that identify g with working memory is that "we do not advance understanding by showing that one mysterious concept is linked to another."[96]
Piagetian tasks
Psychometric theories of intelligence aim at quantifying intellectual growth and identifying ability differences between individuals and groups. In contrast, Jean Piaget's theory of cognitive development seeks to understand qualitative changes in children's intellectual development. Piaget designed a number of tasks to verify hypotheses arising from his theory. The tasks were not intended to measure individual differences, and they have no equivalent in psychometric intelligence tests.[97][98] For example, in one of the best-known Piagetian conservation tasks a child is asked if the amount of water in two identical glasses is the same. After the child agrees that the amount is the same, the investigator pours the water from one of the glasses into a glass of different shape so that the amount appears different although it remains the same. The child is then asked if the amount of water in the two glasses is the same or different. Notwithstanding the different research traditions in which psychometric tests and Piagetian tasks were developed, the correlations between the two types of measures have been found to be consistently positive and generally
G factor (psychometrics) moderate in magnitude. A common general factor underlies them. It has been shown that it is possible to construct a battery consisting of Piagetian tasks that is as good a measure of g as standard IQ tests.[99][100]
81
Personality
The traditional view in psychology is that there is no meaningful relationship between personality and intelligence, and that the two should be studied separately. Intelligence can be understood in terms of what an individual can do, or what his or her maximal performance is, while personality can be thought of in terms of what an individual will typically do, or what his or her general tendencies of behavior are. Research has indicated that correlations between measures of intelligence and personality are small, and it has thus been argued that g is a purely cognitive variable that is independent of personality traits. In a 2007 meta-analysis the correlations between g and the "Big Five" personality traits were found to be as follows: conscientiousness -.04 agreeableness .00 extraversion .02 openness .22 emotional stability .09
The same meta-analysis found a correlation of .20 between self-efficacy and g.[101][102][103] Some researchers have argued that the associations between intelligence and personality, albeit modest, are consistent. They have interpreted correlations between intelligence and personality measures in two main ways. The first perspective is that personality traits influence performance on intelligence tests. For example, a person may fail to perform at a maximal level on an IQ test due to his or her anxiety and stress-proneness. The second perspective considers intelligence and personality to be conceptually related, with personality traits determining how people apply and invest their cognitive abilities, leading to knowledge expansion and greater cognitive differentiation.[101][104]
Creativity
Some researchers believe that there is a threshold level of g below which socially significant creativity is rare, but that otherwise there is no relationship between the two. It has been suggested that this threshold is at least one standard deviation above the population mean. Above the threshold, personality differences are believed to be important determinants of individual variation in creativity.[105][106] Others have challenged the threshold theory. While not disputing that opportunity and personal attributes other than intelligence, such as energy and commitment, are important for creativity, they argue that g is positively associated with creativity even at the high end of the ability distribution. The longitudinal Study of Mathematically Precocious Youth has provided evidence for this contention. It has showed that individuals identified by standardized tests as intellectually gifted in early adolescence accomplish creative achievements (for example, securing patents or publishing literary or scientific works) at several times the rate of the general population, and that even within the top 1 percent of cognitive ability, those with higher ability are more likely to make outstanding achievements. The study has also suggested that the level of g acts as a predictor of the level of achievement, while specific cognitive ability patterns predict the realm of achievement.[107][108]
G factor (psychometrics)
82
Challenges to g
Gf-Gc theory
Raymond Cattell, a student of Charles Spearman's, rejected the unitary g factor model and divided g into two broad, relatively independent domains: fluid intelligence (Gf) and crystallized intelligence (Gc). Gf is conceptualized as a capacity to figure out novel problems, and it is best assessed with tests with little cultural or scholastic content, such as Raven's matrices. Gc can be thought of as consolidated knowledge, reflecting the skills and information that an individual acquires and retains throughout his or her life. Gc is dependent on education and other forms of acculturation, and it is best assessed with tests that emphasize scholastic and cultural knowledge.[109][44][2] Gf can be thought to primarily consist of current reasoning and problem solving capabilities, while Gc reflects the outcome of previously executed cognitive processes.[110] The rationale for the separation of Gf and Gc was to explain individuals' cognitive development over time. While Gf and Gc have been found to be highly correlated, they differ in the way they change over a lifetime. Gf tends to peak at around age 20, slowly declining thereafter. In contrast, Gc is stable or increases across adulthood. A single general factor has been criticized as obscuring this bifurcated pattern of development. Cattell argued that Gf reflected individual differences in the efficiency of the central nervous system. Gc was, in Cattell's thinking, the result of a person "investing" his or her Gf in learning experiences throughout life.[44][2][111][31] Cattell, together with John Horn, later expanded the Gf-Gc model to include a number of other broad abilities, such as Gq (quantitative reasoning) and Gv (visual-spatial reasoning). While all the broad ability factors in the extended Gf-Gc model are positively correlated and thus would enable the extraction of a higher order g factor, Cattell and Horn maintained that it would be erroneous to posit that a general factor underlies these broad abilities. They argued that g factors computed from different test batteries are not invariant and would give different values of g, and that the correlations among tests arise because it is difficult to test just one ability at a time.[112][113][2] However, several researchers have suggested that the Gf-Gc model is compatible with a g-centered understanding of cognitive abilities. For example, John B. Carroll's three-stratum model of intelligence includes both Gf and Gc together with a higher-order g factor. Based on factor analyses of many data sets, some researchers have also argued that Gf and g are one and the same factor and that g factors from different test batteries are substantially invariant provided that the batteries are large and diverse.[44][114][115]
G factor (psychometrics) For example, Gardner contends that a successful career in professional sports or popular music reflects bodily-kinesthetic intelligence and musical intelligence, respectively, even though one might usually talk of athletic and musical skills, talents, or abilities instead. Another criticism of Gardner's theory is that many of his purportedly independent domains of intelligence are in fact correlated with each other. Responding to empirical analyses showing correlations between the domains, Gardner has argued that the correlations exist because of the common format of tests and because all tests require linguistic and logical skills. His critics have in turn pointed out that not all IQ tests are administered in the paper-and-pencil format, that aside from linguistic and logical abilities, IQ test batteries contain also measures of, for example, spatial abilities, and that elementary cognitive tasks (for example, inspection time and reaction time) that do not involve linguistic or logical reasoning correlate with conventional IQ batteries, too.[118][119][67][120] Robert Sternberg, working with various colleagues, has also suggested that intelligence has dimensions independent of g. He argues that there are three classes of intelligence: analytic, practical, and creative. According to Sternberg, traditional psychometric tests measure only analytic intelligence, and should be augmented to test creative and practical intelligence as well. He has devised several tests to this effect. Sternberg equates analytic intelligence with academic intelligence, and contrasts it with practical intelligence, defined as an ability to deal with ill-defined real-life problems. Tacit intelligence is an important component of practical intelligence, consisting of knowledge that is not explicitly taught but is required in many real-life situations. Assessing creativity independent of intelligence tests has traditionally proved difficult, but Sternberg and colleagues have claimed to have created valid tests of creativity, too. The validation of Sternberg's theory requires that the three abilities tested are substantially uncorrelated and have independent predictive validity. Sternberg has conducted many experiments which he claims confirm the validity of his theory, but several researchers have disputed this conclusion. For example, in his reanalysis of a validation study of Sternberg's STAT test, Nathan Brody showed that the predictive validity of the STAT, a test of three allegedly independent abilities, was solely due to a single general factor underlying the tests, which Brody equated with the g factor.[121][122]
83
Other criticisms
Perhaps the most famous critique of the construct of g is that of the paleontologist and biologist Stephen Jay Gould's, presented in his 1981 book The Mismeasure of Man. He argued that psychometricians have fallaciously reified the g factor as a physical thing in the brain, even though it is simply the product of statistical calculations (i.e., factor analysis). He further noted that it is possible to produce factor solutions of cognitive test data that do not contain a g factor yet explain the same amount of information as solutions that yield a g. According to Gould, there is no rationale for preferring one factor solution to another, and factor analysis therefore does not lend support to the existence of an entity like g. More generally, Gould criticized the g theory for abstracting intelligence as a single entity and for ranking people "in a single series of worthiness", arguing that such rankings are used to justify the oppression of disadvantaged groups.[123][37] Many researchers have criticized Gould's arguments. For example, they have rejected the accusation of reification, maintaining that the use of extracted factors such as g as potential causal variables whose reality can be supported or rejected by further investigations constitutes a normal scientific practice that in no way distinguishes psychometrics from other sciences. Critics have also suggested that Gould did not understand the purpose of factor analysis, and that he was ignorant of relevant methodological advances in the field. While different factor solutions may be mathematically equivalent in their ability to account for intercorrelations among tests, solutions that yield a g factor are psychologically preferable for several reasons extrinsic to factor analysis, including the phenomenon of the positive manifold, the fact that the same g can emerge from quite different test batteries, the widespread practical validity of g, and the linkage of g to many biological variables.[38][124][37] John Horn and John McArdle have argued that the modern g theory, as espoused by, for example, Arthur Jensen, is unfalsifiable, because the existence of a common factor follows tautologically from positive correlations among
G factor (psychometrics) tests. They contrasted the modern hierarchical theory of g with Spearman's original two-factor theory which was readily falsifiable (and indeed was falsified).[31]
84
Notes
[1] [2] [3] [4] [5] Kamphaus et al. 2005 Deary et al. 2010 Jensen 1998, 545 Neisser et al. 1996 Adapted from Jensen 1998, 24. The correlation matrix was originally published in Spearman 1904, and it is based on the school performance of a sample of English children. While this analysis is historically important and has been highly influential, it does not meet modern technical standards. See Mackintosh 2011, 44ff. and Horn & McArdle 2007 for discussion of Spearman's methods. [6] Adapted from Chabris 2007, Table 19.1. [7] Gottfredson 1998 [8] Deary 2001, 12 [9] Spearman 1904 [10] Deary 2000, 6 [11] Jensen 1992 [12] Jensen 1998, 28 [13] van deer Maas et al. 2006 [14] Jensen 1998, 26, 3639 [15] Jensen 1998, 26, 3639, 8990 [16] Jensen 2002 [17] Floyd et al. 2009 [18] Jensen 1980, 213 [19] Jensen 1992 [20] Jensen 1980, 213 [21] Jensen 1998, 94 [22] Hunt 2011, 94 [23] Jensen 1998, 1819, 3536, 38. The idea of a general, unitary mental ability was introduced to psychology by Herbert Spencer and Francis Galton in the latter half of the 19th century, but their work was largely speculative, with little empirical basis. [24] Jensen 2002 [25] Jensen 1998, 9192, 95 [26] Jensen 2000 [27] Mackintosh 2011, 157 [28] Jensen 1998, 117 [29] Bartholomew et al. 2009 [30] Jensen 1998, 120 [31] Horn & McArdle 2007 [32] Jensen 1998, 120121 [33] Mackintosh 2011, 157158 [34] Rushton & Jensen 2010 [35] Mackintosh 2011, 4445 [36] Jensen 1998, 18, 3132 [37] Carroll 1995 [38] Jensen 1982 [39] Jensen 1998, 73 [40] Deary 2012 [41] Mackintosh 2011, 57 [42] Jensen 1998, 46 [43] Carroll 1997. The total common factor variance consists of the variance due to the g factor and the group factors considered together. The variance not accounted for by the common factors, referred to as uniqueness, comprises subtest-specific variance and measurement error. [44] Davidson & Kemp 2011 [45] Mackintosh 2011, 151 [46] Jensen 1998, 31 [47] [48] [49] [50] Mackintosh 2011, 151153 McGrew 2005 Kvist & Gustafsson 2008 Hunt 2011, 94
G factor (psychometrics)
[51] Johnson et al. 2004 [52] Johnson et al. 2008 [53] Mackintosh 2011, 150153. See also Keith et al. 2001 where the g factors from the CAS and WJ III test batteries were found to be statistically indistinguishable. [54] Jensen 1998, 88, 101103 [55] Spearman 1927 [56] Detterman & Daniel 1989 [57] Deary & Pagliari 1991 [58] Deary et al. 1996 [59] Tucker-Drob 2009 [60] Jensen 1998, 270 [61] Gottfredson 2002 [62] Sackett et al. 2008 [63] Jensen 1998, 272, 301 [64] Jensen 1998, 270 [65] Jensen 1998, 279280 [66] Jensen 1998, 279 [67] Brody 2006 [68] Frey & Detterman 2003 [69] Schmidt & Hunter 2004 [70] Jensen 1998, 292293 [71] Schmidt & Hunter 2004. These validity coefficients have been corrected for measurement error in the dependent variable (i.e., job or training performance) and for range restriction but not for measurement error in the independent variable (i.e., measures of g). [72] Jensen 1998, 270 [73] Jensen 1998, 568 [74] Jensen 1998, 271 [75] Gottfredson 2007 [76] Deary et al. 2006 [77] Plomin & Spinath 2004 [78] Haworth et al. 2010 [79] Kovas & Plomin 2006 [80] Penke et al. 2007 [81] Chabris et al. 2012 [82] Plomin 2003 [83] Jensen 1998, 189197 [84] Mackintosh 2011, 134138 [85] Chabris 2007 [86] Jensen 1998, 146, 149150 [87] Jensen 1998, 164165 [88] Jensen 1998, 8788 [89] Mackintosh 2011, 360373 [90] Roth et al. 2001 [91] Hunt 2011, 421 [92] Jensen 1998, 369399 [93] Lynn 2003 [94] Jensen 1998, 213 [95] Ackerman et al. 2005 [96] Mackintosh 2011, 158 [97] Weinberg 1989 [98] Lautrey 2002 [99] Humphreys et al. 1985 [100] Weinberg 1989 [101] von Stumm et al. 2011 [102] Jensen 1998, 573 [103] Judge et al. 2007 [104] von Stumm et al. 2009 [105] Jensen 1998, 577 [106] Eysenck 1995 [107] Lubinski 2009
85
G factor (psychometrics)
[108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] Robertson et al. 2010 Jensen 1998, 122123 Sternberg et al. 1981 Jensen 1998, 123 Jensen 1998, 124 McGrew 2005 Jensen 1998, 125 Mackintosh 2011, 152153 Jensen 1998, 7778, 115117 Mackintosh 2011, 52, 239 Jensen 1998, 128132 Deary 2001, 1516 Mackintosh 2011, 236237 Hunt 2011, 120130 Mackintosh 2011, 223235 Gould 1996, 5657 Korb 1994
86
References
Ackerman, P. L., Beier, M. E., & Boyle, M. O. (2005). Working memory and intelligence: The same or different constructs? Psychological Bulletin, 131, 3060. Bartholomew, D.J., Deary, I.J., & Lawn, M. (2009). A New Lease of Life for Thomsons Bonds Model of Intelligence. (http://www.psy.ed.ac.uk/people/iand/Bartholomew (2009) Psych Review thomson intelligence.pdf) Psychological Review, 116, 567579. Brody, N. (2006). Geocentric theory: A valid alternative to Gardner's theory of intelligence. In Schaler J. A. (Ed.), Howard Gardner under fire: The rebel psychologist faces his critics. Chicago: Open Court. Carroll, J.B. (1995). Reflections on Stephen Jay Gould's The Mismeasure of Man (1981): A Retrospective Review. (http://www.psych.utoronto.ca/users/reingold/courses/intelligence/cache/carroll-gould.html) Intelligence, 21, 121134. Carroll, J.B. (1997). Psychometrics, Intelligence, and Public Perception. (http://www.iapsych.com/wj3ewok/ LinkedDocuments/carroll1997.pdf) Intelligence, 24, 2552. Chabris, C.F. (2007). Cognitive and Neurobiological Mechanisms of the Law of General Intelligence. (http:// www.wjh.harvard.edu/~cfc/Chabris2007a.pdf) In Roberts, M. J. (Ed.) Integrating the mind: Domain general versus domain specific processes in higher cognition. Hove, UK: Psychology Press. Chabris, C.F., Hebert, B.M, Benjamin, D.J., Beauchamp, J.P., Cesarini, D., van der Loos, M.J.H.M., Johannesson, M., Magnusson, P.K.E., Lichtenstein, P., Atwood, C.S., Freese, J., Hauser, T.S., Hauser, R.M., Christakis, N.A., and Laibson, D. (2012). "Most Reported Genetic Associations with General Intelligence Are Probably False Positives" (http://coglab.wjh.harvard.edu/~cfc/Chabris2012a-FalsePositivesGenesIQ.pdf). Psychological Science 23 (11): 13141323. Davidson, J.E. & Kemp, I.A. (2011). Contemporary models of intelligence. In R.J. Sternberg & S.B. Kaufman (Eds.), The Cambridge Handbook of Intelligence. New York, NY: Cambridge University Press. Deary, I.J. (2012). Intelligence. Annual Review of Psychology, 63, 453482. Deary, I.J. (2001). Intelligence. A Very Short Introduction. Oxford: Oxford University Press. Deary I.J. (2000). Looking Down on Human Intelligence: From Psychometrics to the Brain. Oxford, England: Oxford University Press. Deary, I.J., & Pagliari, C. (1991). The strength of g at different levels of ability: Have Detterman and Daniel rediscovered Spearmans law of diminishing returns? Intelligence, 15, 247250. Deary, I.J., Egan, V., Gibson, G.J., Brand, C.R., Austin, E., & Kellaghan, T. (1996). Intelligence and the differentiation hypothesis. Intelligence, 23, 105132. Deary, I.J., Spinath, F.M. & Bates, T.C. (2006). Genetics of intelligence. Eur J Hum Genet, 14, 690700.
G factor (psychometrics) Deary, I.J., Penke, L., & Johnson, W. (2010). The neuroscience of human intelligence differences (http://www. larspenke.eu/pdfs/Deary_Penke_Johnson_2010_-_Neuroscience_of_intelligence_review.pdf). Nature Reviews Neuroscience, 11, 201211. Detterman, D.K., & Daniel, M.H. (1989). Correlations of mental tests with each other and with cognitive variables are highest for low-IQ groups. Intelligence, 13, 349359. Eysenck, H.J. (1995). Creativity as a product of intelligence and personality. In Saklofske, D.H. & Zeidner, M. (Eds.), International Handbook of Personality and Intelligence (pp. 231247). New York, NY, US: Plenum Press. Floyd, R. G., Shands, E. I., Rafael, F. A., Bergeron, R., & McGrew, K. S. (2009). The dependability of general-factor loadings: The effects of factor-extraction methods, test battery composition, test battery size, and their interactions. (http://www.iapsych.com/kmpubs/floyd2009b.pdf) Intelligence, 37, 453465. Frey, M. C.; Detterman, D. K. (2003). "Scholastic Assessment or g? The Relationship Between the Scholastic Assessment Test and General Cognitive Ability" (http://www.psychologicalscience.org/pdf/ps/frey.pdf). Psychological Science 15 (6): 373378. doi: 10.1111/j.0956-7976.2004.00687.x (http://dx.doi.org/10.1111/j. 0956-7976.2004.00687.x). PMID 15147489 (http://www.ncbi.nlm.nih.gov/pubmed/15147489). Gottfredson, L. S. (1998, Winter). The general intelligence factor. Scientific American Presents, 9(4), 2429. Gottfredson, L. S. (2002). g: Highly general and highly practical. Pages 331380 in R. J. Sternberg & E. L. Grigorenko (Eds.), The general factor of intelligence: How general is it? Mahwah, NJ: Erlbaum. Gottfredson, L.S. (2007). Innovation, fatal accidents, and the evolution of general intelligence. (http://www. udel.edu/educ/gottfredson/reprints/2007evolutionofintelligence.pdf) In M. J. Roberts (Ed.), Integrating the mind: Domain general versus domain specific processes in higher cognition (pp. 387425). Hove, UK: Psychology Press. Gottfredson, L.S. (2011). Intelligence and social inequality: Why the biological link? (http://www.udel.edu/ educ/gottfredson/reprints/2011SocialInequality.pdf) Pp. 538575 in T. Chamorro-Premuzic, A. Furhnam, & S. von Stumm (Eds.), Handbook of Individual Differences. Wiley-Blackwell. Gould, S.J. (1996, Revised Edition). The Mismeasure of Man. New York: W. W. Norton & Company. Haworth, C.M.A. et al. (2010). The heritability of general cognitive ability increases linearly from childhood to young adulthood. Mol Psychiatry, 15, 11121120. Horn, J. L. & McArdle, J.J. (2007). Understanding human intelligence since Spearman. In R. Cudeck & R. MacCallum, (Eds.). Factor Analysis at 100 years (pp. 205247). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Humphreys, L.G., Rich, S.A. & Davey, T.C. (1985). A Piagetian Test of General Intelligence. Developmental Psychology, 21, 872877. Hunt, E.B. (2011). Human Intelligence. Cambridge, UK: Cambridge University Press. Jensen, A.R. (1980). Bias in Mental Testing. New York: The Free Press. Jensen, A.R. (1982). The Debunking of Scientific Fossils and Straw Persons. (http://www.debunker.com/texts/ jensen.html) Contemporary Education Review, 1, 121135. Jensen, A.R. (1992). Understanding g in terms of information processing. Educational Psychology Review, 4, 271308. Jensen, A.R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger. ISBN 0-275-96103-6 Jensen, A.R. (2000). A Nihilistic Philosophy of Science for a Scientific Psychology? (http://www.cogsci.ecs. soton.ac.uk/cgi/psyc/newpsy?11.088) Psycoloquy, 11, Issue 088, Article 49. Jensen, A.R. (2002). Psychometric g: Denition and substantiation. In R.J. Sternberg & E.L. Grigorenko (Eds.), General factor of intelligence: How general is it? (pp. 3954). Mahwah, NJ: Erlbaum. Johnson, W., Bouchard, T.J., Krueger, R.F., McGue, M. & Gottesman, I.I. (2004). Just one g: Consistent results from three test batteries. Intelligence, 32, 95107.
87
G factor (psychometrics) Johnson, W., te Nijenhuis, J. & Bouchard Jr., T. (2008). Still just 1 g: Consistent results from five test batteries. Intelligence, 36, 8195. Judge, T. A., Jackson, C. L., Shaw, J. C., Scott, B. A., and Rich, B. L. (2007). Self-efficacy and work-related performance: The integral role of individual differences. Journal of Applied Psychology, 92, 107127. Kamphaus, R.W., Winsor, A.P., Rowe, E.W., & Kim, S. (2005). A history of intelligence test interpretation. In D.P. Flanagan and P.L. Harrison (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (2nd Ed.) (pp. 2338). New York: Guilford. Kane, M. J., Hambrick, D. Z., & Conway, A. R. A. (2005). Working memory capacity and fluid intelligence are strongly related constructs: Comment on Ackerman, Beier, and Boyle (2004). Psychological Bulletin, 131, 6671. Keith, T.Z., Kranzler, J.H., and Flanagan, D.P. (2001). What does the Cognitive Assessment System (CAS) measure? Joint confirmatory factor analysis of the CAS and the Woodcock-Johnson Tests of Cognitive Ability (3rd Edition). School Psychology Review, 30, 89119. Korb, K. B. (1994). Stephen Jay Gould on intelligence. Cognition, 52, 111123. Kovas, Y. & Plomin, R. (2006). Generalist genes: implications for the cognitive sciences. TRENDS in Cognitive Sciences, 10, 198203. Kvist, A. & Gustafsson, J.-E. (2008). The relation between fluid intelligence and the general factor as a function of cultural background: A test of Cattell's Investment theory. Intelligence 36, 422436. Lautrey, J. (2002). Is there a general factor of cognitive development? In Sternberg, R.J. & Grigorenko, E.L. (Eds.), The general factor of intelligence: How general is it? Mahwah, NJ: Erlbaum. Lubinski, D. (2009). Exceptional Cognitive Ability: The Phenotype. Behavior Genetics, 39, 350358, DOI: 10.1007/s10519-009-9273-0. Lynn, R. (2003). The Geography of Intelligence. In Nyborg, H. (ed.), The Scientific Study of General Intelligence: Tribute to Arthur R. Jensen (pp. 126146). Oxford: Pergamon. Mackintosh, N.J. (2011). IQ and Human Intelligence. Oxford, UK: Oxford University Press. McGrew, K.S. (2005). The Cattell-Horn-Carroll Theory of Cognitive Abilities: Past, Present, and Future. Contemporary Intellectual Assessment: Theories, Tests, and Issues. (pp. 136181) New York, NY, US: Guilford Press Flanagan, Dawn P. (Ed); Harrison, Patti L. (Ed), (2005). xvii, 667 pp. Neisser, U., Boodoo, G., Bouchard Jr., T.J., Boykin, A.W., Brody, N., Ceci, S.J., Halpern, D.F., Loehlin, J.C. & Perloff, R. (1996). "Intelligence: Knowns and Unknowns". American Psychologist, 51, 77101 Oberauer, K., Schulze, R., Wilhelm, O., & S, H.-M. (2005). Working memory and intelligence their correlation and their relation: A comment on Ackerman, Beier, and Boyle (2005). Psychological Bulletin, 131, 6165. Penke, L., Denissen, J.J.A., and Miller, G.F. (2007). The Evolutionary Genetics of Personality (http:// matthewckeller.com/Penke_EvoGenPersonality_2007.pdf). European Journal of Personality, 21, 549587. Plomin, R. (2003). Genetics, genes, genomics and g. Molecular Psychiatry, 8, 15. Plomin, R. & Spinath, F.M. (2004). Intelligence: genetics, genes, and genomics. J Pers Soc Psychol, 86, 112129. Robertson, K.F., Smeets, S., Lubinski, D., & Benbow, C.P. (2010). Beyond the Threshold Hypothesis: Even Among the Gifted and Top Math/Science Graduate Students, Cognitive Abilities, Vocational Interests, and Lifestyle Preferences Matter for Career Choice, Performance, and Persistence. Current Directions in Psychological Science, 19, 346351. Roth, P.L., Bevier, C.A., Bobko, P., Switzer, F.S., III, & Tyler, P. (2001). Ethnic group differences in cognitive ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297330. Rushton, J.P. & Jensen, A.R. (2010). The rise and fall of the Flynn Effect as a reason to expect a narrowing of the BlackWhite IQ gap. Intelligence, 38, 213219. doi:10.1016/j.intell.2009.12.002. Sackett, P.R., Borneman, M.J., and Connelly, B.S. (2008). High-Stakes Testing in Higher Education and Employment. Appraising the Evidence for Validity and Fairness. American Psychologist, 63, 215227.
88
G factor (psychometrics) Schmidt, F.L. & Hunter, J. (2004). General Mental Ability in the World of Work: Occupational Attainment and Job Performance (http://www.unc.edu/~nielsen/soci708/cdocs/Schmidt_Hunter_2004.pdf). Journal of Personality and Social Psychology, 86, 162173. Spearman, C.E. (1904). "'General intelligence', Objectively Determined And Measured" (http://www.psych. umn.edu/faculty/waller/classes/FA2010/Readings/Spearman1904.pdf). American Journal of Psychology, 15, 201293. Spearman, C.E. (1927). The Abilities of Man. London: Macmillan. Sternberg, R. J., Conway, B. E., Ketron, J. L. & Bernstein, M. (1981). Peoples conception of intelligence. Journal of Personality and Social Psychology, 41, 3755. von Stumm, S., Chamorro-Premuzic, T., Quiroga, M.., and Colom, R. (2009). Separating narrow and general variances in intelligence-personality associations. Personality and Individual Differences, 47, 336341. von Stumm, S., Chamorro-Premuzic, T., Ackerman, P. L. (2011). Re-visiting intelligence-personality associations: Vindicating intellectual investment. In T. Chamorro-Premuzic, S. von Stumm, & A. Furnham (eds.), Handbook of Individual Differences. Chichester, UK: Wiley-Blackwell. Tucker-Drob, E.M. (2009). Differentiation of cognitive abilities across the life span. Developmental Psychology, 45, 10971118. van der Maas, H. L. J., Dolan, C. V., Grasman, R. P. P. P., Wicherts, J. M., Huizenga, H. M., & Raaijmakers, M. E. J. (2006). A dynamical model of general intelligence: The positive manifold of intelligence by mutualism. (http://wicherts.socsci.uva.nl/maas2006.pdf) Psychological Review, 13, 842860. Weinberg, R.A. (1989). Intelligence and IQ. Landmark Issues and Great Debates. American Psychologist, 44, 98104.
89
External links
The General Intelligence Factor by Linda S. Gottfredson (http://www.udel.edu/educ/gottfredson/reprints/ 1998generalintelligencefactor.pdf)
Francis Galton
90
Francis Galton
Sir Francis Galton
Born
February 16, 1822 Birmingham, England 17 January 1911 (aged88) Haslemere, Surrey, England England English Anthropology and polymathy Meteorological Council Royal Geographical Society King's College London Cambridge University
Died
Alma mater
Academic advisors William Hopkins Notable students Knownfor Karl Pearson Eugenics The Galton board Regression toward the mean Standard deviation Weather map Linnean Society of London's DarwinWallace Medal in 1908. Copley medal (1910)
Notable awards
Sir Francis Galton, FRS (/frnssHelp:IPA for English#Keyltn/; 16 February 1822 17 January 1911), cousin of Douglas Strutt Galton, cousin of Charles Darwin, was an English Victorian polymath: anthropologist, eugenicist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist, psychometrician, and statistician. He was knighted in 1909. Galton produced over 340 papers and books. He also created the statistical concept of correlation and widely promoted regression toward the mean. He was the first to apply statistical methods to the study of human differences and inheritance of intelligence, and introduced the use of questionnaires and surveys for collecting data on human
Francis Galton communities, which he needed for genealogical and biographical works and for his anthropometric studies. He was a pioneer in eugenics, coining the term itself and the phrase "nature versus nurture". His book Hereditary Genius (1869) was the first social scientific attempt to study genius and greatness.[1] As an investigator of the human mind, he founded psychometrics (the science of measuring mental faculties) and differential psychology and the lexical hypothesis of personality. He devised a method for classifying fingerprints that proved useful in forensic science. He also conducted research on the power of prayer, concluding it had none by its null effects on the longevity of those prayed for.[2] As the initiator of scientific meteorology, he devised the first weather map, proposed a theory of anticyclones, and was the first to establish a complete record of short-term climatic phenomena on a European scale.[3] He also invented the Galton Whistle for testing differential hearing ability. [4]
91
Biography
Early life
Galton was born at "The Larches", a large house in the Sparkbrook area of Birmingham, England, built on the site of "Fair Hill", the former home of Joseph Priestley, which the botanist William Withering had renamed. He was Charles Darwin's half-cousin, sharing the common grandparent Erasmus Darwin. His father was Samuel Tertius Galton, son of Samuel "John" Galton. The Galtons were famous and highly successful Quaker gun-manufacturers and bankers, while the Darwins were distinguished in medicine and science. Both families boasted Fellows of the Royal Society and members who loved to invent in their spare time. Both Erasmus Darwin and Samuel Galton were founding members of the famous Lunar Society of Birmingham, whose members included Boulton, Watt, Wedgwood, Priestley, Edgeworth, and other distinguished scientists and industrialists. Likewise, both families were known for their literary talent: Erasmus Darwin composed lengthy technical treatises in verse; Galton's aunt Mary Anne Galton wrote on aesthetics and religion, and her notable autobiography detailed the unique environment of her childhood populated by Lunar Society members. Galton was by many accounts a child prodigy he was reading by the age of 2, at age 5 he knew some Greek, Latin and long division, and by the age of six he had moved on to adult books, including Shakespeare for pleasure, and poetry, which he quoted at length (Bulmer 2003, p.4). Later in life, Galton would propose a connection between genius and insanity based on his own experience. He stated, Men who leave their mark on the world are very often those who, being gifted and full of nervous power, are at the same time haunted and driven by a dominant idea, and are therefore within a measurable distance of insanity[5] Galton attended King Edward's School, Birmingham, but chafed at the narrow classical curriculum and left at 16.[6] His parents pressed him to enter the medical profession, and he studied for two years at Birmingham General Hospital and King's College, London Medical School. He followed this up with mathematical studies at Trinity College, University of Cambridge, from 1840 to early 1844.[7]
According to the records of the United Grand Lodge of England, it was in February 1844 that Galton became a freemason at the so-called Scientific lodge, held at the Red Lion Inn in Cambridge, progressing through the three masonic degrees as follows: Apprentice, 5 Feb 1844; Fellow Craft, 11 March 1844; Master Mason, 13 May 1844. A curious note in the record states: "Francis Galton Trinity College student, gained his certificate 13 March 1845".[8]
Francis Galton One of Galton's masonic certificates from Scientific lodge can be found among his papers at University College, London.[9] A severe nervous breakdown altered Galton's original intention to try for honours. He elected instead to take a "poll" (pass) B.A. degree, like his half-cousin Charles Darwin (Bulmer 2003, p.5). (Following the Cambridge custom, he was awarded an M.A. without further study, in 1847.) He then briefly resumed his medical studies. The death of his father in 1844 had left him financially independent but emotionally destitute,[10] and he terminated his medical studies entirely, turning to foreign travel, sport and technical invention. In his early years Galton was an enthusiastic traveller, and made a notable solo trip through Eastern Europe to Constantinople, before going up to Cambridge. In 1845 and 1846 he went to Egypt and travelled down the Nile to Khartoum in the Sudan, and from there to Beirut, Damascus and down the Jordan. In 1850 he joined the Royal Geographical Society, and over the next two years mounted a long and difficult expedition into then little-known South West Africa (now Namibia). He wrote a successful book on his experience, "Narrative of an Explorer in Tropical South Africa". He was awarded the Royal Geographical Society's gold medal in 1853 and the Silver Medal of the French Geographical Society for his pioneering cartographic survey of the region (Bulmer 2003, p.16). This established his reputation as a geographer and explorer. He proceeded to write the best-selling The Art of Travel, a handbook of practical advice for the Victorian on the move, which went through many editions and is still in print. In January 1853 Galton met Louisa Jane Butler (18221897) at his neighbour's home and they were married on 1 August 1853. The union of 43 years proved childless. [11] [12]
92
Middle years
Galton was a polymath who made important contributions in many fields of science, including meteorology (the anti-cyclone and the first popular weather maps), statistics (regression and correlation), psychology (synaesthesia), biology (the nature and mechanism of heredity), and criminology (fingerprints). Much of this was influenced by his penchant for counting or measuring. Galton prepared the first weather map published in The Times (1 April 1875, showing the weather from the previous day, 31 March), now a standard feature in newspapers worldwide.[13] He became very active in the British Association for the Advancement of Science, presenting many papers on a wide variety of topics at its meetings from 1858 to 1899 (Bulmer 2010, p.29). He was the general secretary from 1863 to 1867, president of the Geographical section in 1867 and 1872, and president of the Anthropological Section in 1877 and 1885. He was active on the council of the Royal Geographical Society for over forty years, in various committees of the Royal
Society, and on the Meteorological Council. James McKeen Cattell, a student of Wilhelm Wundt who had been reading Galton's articles, decided he wanted to study under him. He eventually built a professional relationship with Galton, measuring subjects and working together on research.[14] In 1888, Galton established a lab in the science galleries of the South Kensington Museum. In Galton's lab, participants could be measured in order to gain knowledge of their strengths and weaknesses. Galton also used these data for his own research. He would typically charge people a small fee for his services.[15]
Francis Galton During this time, Galton wrote a controversial letter to the Times titled 'Africa for the Chinese', where he argued that the Chinese, as a race capable of high civilization and (in his opinion) only temporarily stunted by the recent failures of Chinese dynasties, should be encouraged to immigrate to Africa and displace the supposedly inferior aboriginal blacks.[16]
93
Galton was interested at first in the question of whether human ability was hereditary, and proposed to count the number of the relatives of various degrees of eminent men. If the qualities were hereditary, he reasoned, there should be more eminent men among the relatives than among the general population. To test this, he invented the methods of historiometry. Galton obtained extensive data from a broad range of biographical sources which he tabulated and compared in various ways. This pioneering work was described in detail in his book Hereditary Genius in 1869.[1] Here he showed, among other things, that the numbers of eminent relatives dropped off when going from the first degree to the second degree relatives, and from the second degree to the third. He took this as evidence of the inheritance of abilities. Galton recognized the limitations of his methods in these two works, and believed the question could be better studied by comparisons of twins. His method envisaged testing to see if twins who were similar at birth diverged in dissimilar environments, and whether twins dissimilar at birth converged when reared in similar environments. He again used the method of questionnaires to gather various sorts of data, which were tabulated and described in a paper The history of twins in 1875. In so doing he anticipated the modern field of behavior genetics, which relies heavily on twin studies. He concluded that the evidence favored nature rather than nurture. He also proposed adoption studies, including trans-racial adoption studies, to separate the effects of heredity and environment. Galton recognised that cultural circumstances influenced the capability of a civilization's citizens, and their reproductive success. In Hereditary Genius, he envisaged a situation conducive to resilient and enduring civilisation as follows: The best form of civilization in respect to the improvement of the race, would be one in which society was not costly; where incomes were chiefly derived from professional sources, and not much through inheritance; where every lad had a chance of showing his abilities, and, if highly gifted, was enabled to achieve a first-class education and entrance into professional life, by the liberal help of the exhibitions and scholarships which he had gained in his early youth; where marriage was held in as high honour as in ancient Jewish times; where the pride of race was encouraged (of course I do not refer to the nonsensical sentiment of the present day, that goes under that name); where the weak could find a welcome and a refuge in celibate monasteries or sisterhoods, and lastly, where the better sort of
Francis Galton emigrants and refugees from other lands were invited and welcomed, and their descendants naturalized. (p362) [1] Galton invented the term eugenics in 1883 and set down many of his observations and conclusions in a book, Inquiries into Human Faculty and Its Development.[18] He believed that a scheme of 'marks' for family merit should be defined, and early marriage between families of high rank be encouraged by provision of monetary incentives. He pointed out some of the tendencies in British society, such as the late marriages of eminent people, and the paucity of their children, which he thought were dysgenic. He advocated encouraging eugenic marriages by supplying able couples with incentives to have children. On October 29, 1901, Galton chose to address eugenic issues when he delivered the second Huxley lecture at the Royal Anthropological Institute[14] The Eugenics Review, the journal of the Eugenics Education Society, commenced publication in 1909. Galton, the Honorary President of the society, wrote the foreword for the first volume.[14] The First International Congress of Eugenics was held in July 1912. Galton died just two weeks before the day of the congress. Winston Churchill and Carls Elliot were among the attendees.[14]
94
Francis Galton
95
Francis Galton volumes, 18261836). Correlation originated in the study of correspondence as described in the study of morphology. See R.S. Russell, Form and Function. He was not the first to describe the mathematical relationship represented by the correlation coefficient, but he rediscovered this relationship and demonstrated its application in the study of heredity, anthropology, and psychology.[21] Galton's later statistical study of the probability of extinction of surnames led to the concept of GaltonWatson stochastic processes (Bulmer 2003, pp.182184). This is now a core of modern statistics and regression. Galton invented the use of the regression line (Bulmer 2003, p.184), and was the first to describe and explain the common phenomenon of regression toward the mean, which he first observed in his experiments on the size of the seeds of successive generations of sweet peas. He is responsible for the choice of r (for reversion or regression) to represent the correlation coefficient.[21] In the 1870s and 1880s he was a pioneer in the use of normal distribution to fit histograms of actual tabulated data. Theories of perception Galton went beyond measurement and summary to attempt to explain the phenomena he observed. Among such developments, he proposed an early theory of ranges of sound and hearing, and collected large quantities of anthropometric data from the public through his popular and long-running Anthropometric Laboratory, which he established in 1884 where he studied over 9,000 people.[14] It was not until 1985 that these data were analyzed in their entirety. Differential psychology Galton's study of human abilities ultimately led to the foundation of differential psychology and the formulation of the first mental tests. He was interested in measuring humans in every way possible. This included measuring their ability to make sensory discrimination which he assumed was linked to intellectual prowess. Galton suggested that individual dierences in general ability are reected in performance on relatively simple sensory capacities and in speed of reaction to a stimulus, variables that could be objectively measured by tests of sensory discrimination and reaction time Jensen, Arthur R. (April 2002). "GALTONS LEGACY TO RESEARCH ON INTELLIGENCE" [27]. Journal of Biosocial Science. 34 (2): 145-172.He also measured how quickly people reacted which he later linked to internal wiring which ultimately limited intelligence ability. Throughout his research Galton assumed that people who reacted faster were more intelligent than others. Composite photography Galton also devised a technique called composite portraiture" (produced by superimposing multiple photographic portraits of individuals' faces registered on their eyes) to create an average face. (See averageness). In the 1990's, a hundred years after his discovery, much psychological research has examined the attractiveness of these faces, an aspect that Galton had remarked on in his original lecture. Others, including Sigmund Freud in his work on dreams, picked up Galton's suggestion that these composites might represent a useful metaphor for an ideal or a concept of a natural kind" (see Eleanor Rosch) such as Jewish men, criminals, patients with tuberculosis, etc. onto the same photographic plate, thereby yielding a blended whole, or composite), that he hoped could generalize the facial appearance of his subject into an average or central type..[4][28] See also entry Modern physiognomy under Physiognomy). This work began in the 1880s while the Jewish scholar Joseph Jacobs studied anthropology and statistics with Francis Galton. Jacobs asked Galton to create a composite photograph of a Jewish type.[29] One of Jacobs' first publications that used Galton's composite imagery was The Jewish Type, and Galtons Composite Photographs, Photographic News, 29, (April 24, 1885): 268269. Galton hoped his technique would aid medical diagnosis, and even criminology through the identification of typical criminal faces. However, his technique did not prove useful and fell into disuse, although after much work on it including by photographers Lewis Hine and John L. Lovell and Arthur Batut.
96
Francis Galton
97
Fingerprints
In a Royal Institution paper in 1888 and three books (Finger Prints, 1892; Decipherment of Blurred Finger Prints, 1893; and Fingerprint Directories, 1895)[30] Galton estimated the probability of two persons having the same fingerprint and studied the heritability and racial differences in fingerprints. He wrote about the technique (inadvertently sparking a controversy between Herschel and Faulds that was to last until 1917), identifying common pattern in fingerprints and devising a classification system that survives to this day. The method of identifying criminals by their fingerprints had been introduced in the 1860s by Sir William James Herschel in India, and their potential use in forensic work was first proposed by Dr Henry Faulds in 1880, but Galton was the first to place the study on a scientific footing, which assisted its acceptance by the courts (Bulmer 2003, p.35). Galton pointed out that there were specific types of fingerprint patterns. He described and classified them into eight broad categories. 1: plain arch, 2: tented arch, 3: simple loop, 4: central pocket loop, 5: double loop, 6: lateral pocket loop, 7: plain whorl, and 8: accidental.[]
Final years
In an effort to reach a wider audience, Galton worked on a novel entitled Kantsaywhere from May until December 1910. The novel described a utopia organized by a eugenic religion, designed to breed fitter and smarter humans. His unpublished notebooks show that this was an expansion of material he had been composing since at least 1901. He offered it to Methuen for publication, but they showed little enthusiasm. Galton wrote to his niece that it should be either smothered or superseded. His niece appears to have burnt most of the novel, offended by the love scenes, but large fragments survived.[31]
Galton was knighted in 1909. His statistical heir Karl Pearson, first holder of the Galton Chair of Eugenics at University College London (now Galton Chair of Genetics), wrote a three-volume biography of Galton, in four parts, after his death (Pearson 1914, 1924, 1930). The eminent psychometrician Lewis Terman estimated that his childhood IQ was on the order of 200, based on the fact that he consistently performed mentally at roughly twice his chronological age (Forrest 1974). (This follows the original definition of IQ as mental age divided by chronological age, rather than the modern definition based on the standard distribution and standard deviation.) The flowering plant genus Galtonia was named in his honour.
Francis Galton
98
Major Works
Galton, F. (1869). Hereditary Genius [33]. London: Macmillan. Galton, F (1883). Inquiries into Human Faculty and Its Development [34]. London: J.M. Dent & Company
References
[1] Galton, F. (1869). Hereditary Genius (http:/ / galton. org/ books/ hereditary-genius/ ). London: Macmillan. [2] http:/ / www. abelard. org/ galton/ galton. htm [3] Francis Galton (18221911) from Eric Weisstein's World of Scientific Biography (http:/ / scienceworld. wolfram. com/ biography/ Galton. html) [4] Galton, Francis (1883). Inquiries into Human Faculty and Its Development (http:/ / www. galton. org/ books/ human-faculty/ index. html). London: J.M. Dent & Co. [5] Pearson, K. (1914). The life, letters and labours of Francis Galton (4 vols.). Cambridge:Cambridge University Press. [6] Oxford Dictionary of National Biography accessed 31 January 2010 [8] 'Scientific Lodge No. 105 Cambridge' in Membership Records: Foreign and Country Lodges, Nos. 17-145, 1837-1862. London: Library and Museum of Freemasonry (manuscript) [9] M. Merrington and J. Golden (1976) A List of the Papers and Correspondence of Sir Francis Galton (1822-1911) held in The Manuscripts Room, The Library, University College London. The Galton Laboratory, University College London (typescript), at Section 88 on p. 10 [10] citation? [11] Life of Francis Galton by Karl Pearson Vol 2 : image 0320 (http:/ / galton. org/ cgi-bin/ searchImages/ search/ pearson/ vol2/ pages/ vol2_0320. htm) [12] http:/ / www. stanford. edu/ group/ auden/ cgi-bin/ auden/ individual. php?pid=I7570& ged=auden-bicknell. ged [13] http:/ / www. galton. org/ meteorologist. html [14] Gillham, Nicholas Wright (2001). A Life of Sir Francis Galton: From African Exploration to the Birth of Eugenics, Oxford University Press. ISBN 0-19-514365-5. [15] Hergenhahn, B.R., (2008). An Introduction to the History of Psychology. Colorado: Wadsworth Pub. [16] http:/ / galton. org/ letters/ africa-for-chinese/ AfricaForTheChinese. htm [17] Forrest DW 1974. Francis Galton: the life and work of a Victorian genius. Elek, London. p84 [18] Inquiries into Human Faculty and Its Development by Francis Galton (http:/ / galton. org/ books/ human-faculty/ ) [19] Science Show 25/11/00: Sir Francis Galton (http:/ / www. abc. net. au/ rn/ science/ ss/ stories/ s216074. htm) [20] http:/ / darwin-online. org. uk/ content/ frameset?itemID=F1751& viewtype=side& pageseq=1 [21] Clauser, Brian E. (2007). The Life and Labors of Francis Galton: A review of Four Recent Books About the Father of Behavioral Statistics. 32(4), p. 440-444. [22] http:/ / www. sciencetimeline. net/ 1866. htm. [23] Galton, F., " Vox Populi (http:/ / galton. org/ essays/ 1900-1911/ galton-1907-vox-populi. pdf)", Nature, March 7, 1907, accessed 2012-07-25 [24] " The Ballot Box (http:/ / galton. org/ cgi-bin/ searchImages/ galton/ search/ essays/ pages/ galton-1907-ballot-box_1. htm)", Nature, March 28, 1907, accessed 2012-07-25 [25] adamsmithlives.blogs.com posting (http:/ / adamsmithlives. blogs. com/ thoughts/ 2007/ 10/ experts-and-inf. html) [27] http:/ / journals2. scholarsportal. info. myaccess. library. utoronto. ca/ tmp/ 2802204478791895184. pdf [28] Galton, F. (1878). Composite portraits. (http:/ / www. galton. org/ essays/ 1870-1879/ galton-1879-jaigi-composite-portraits. pdf) Journal of the Anthropological Institute of Great Britain and Ireland, 8, 132142. [29] Daniel Akiva Novak. Realism, photography, and nineteenth-century (http:/ / books. google. com/ books?id=UeiMt7Yzb1MC& pg=PA100& lpg=PA100& dq=Francis+ Galton+ jewish+ boys& source=bl& ots=Hj6o5LrTjj& sig=R4e5tBliXpezKQhnX2hgG1YGwjg& hl=en& ei=S-QBSo7oBpbisgOluOz8BQ& sa=X& oi=book_result& ct=result& resnum=1) Cambridge University Press, 2008 ISBN 0-521-88525-6 [30] Conklin, Barbara Gardner., Robert Gardner, and Dennis Shortelle. Encyclopedia of Forensic Science: a Compendium of Detective Fact and Fiction. Westport, Conn.: Oryx, 2002. Print. [31] Life of Francis Galton by Karl Pearson Vol 3a : image 470 (http:/ / www. mugu. com/ browse/ galton/ search/ pearson/ vol3a/ pages/ vol3a_0470. htm) [33] http:/ / galton. org/ books/ hereditary-genius/ [34] http:/ / www. galton. org/ books/ human-faculty/ index. html
Francis Galton
99
Further reading
Brookes, Martin (2004). Extreme Measures: The Dark Visions and Bright Ideas of Francis Galton. Bloomsbury. Bulmer, Michael (2003). Francis Galton: Pioneer of Heredity and Biometry. Johns Hopkins University Press. ISBN0-8018-7403-3 Cowan, Ruth Schwartz (1985, 1969). Sir Francis Galton and the Study of Heredity in the Nineteenth Century. Garland (1985). Originally Cowan's Ph.D. dissertation, Johns Hopkins University, (1969). Ewen, Stuart and Elizabeth Ewen (2006; 2008) "Nordic Nightmares," pp.257325 in Typecasting: On the Arts and Sciences of Human Inequality, Seven Stories Press. ISBN 978-1-58322-735-0 Forrest, D.W (1974). Francis Galton: The Life and Work of a Victorian Genius. Taplinger. ISBN0-8008-2682-5 Galton, Francis (1909). Memories of My Life: (http://books.google.com/?id=MvAIAAAAIAAJ&pg=PA3& dq=Samuel+"John"+Galton). New York: E. P. Dutton and Company. Gillham, Nicholas Wright (2001). A Life of Sir Francis Galton: From African Exploration to the Birth of Eugenics, Oxford University Press. ISBN 0-19-514365-5 Pearson, Karl (1914, 1924, 1930). "The life, letters and labours of Francis Galton (3 vols.)" (http://galton.org) Danille Posthuma, Eco J. C. De Geus, Wim F. C. Baar, Hilleke E. Hulshoff Pol, Ren S. Kahn & Dorret I. Boomsma (2002). "The association between brain volume and intelligence is of genetic origin". Nature Neuroscience 5 (2): 8384. doi: 10.1038/nn0202-83 (http://dx.doi.org/10.1038/nn0202-83). PMID 11818967 (http://www.ncbi.nlm.nih.gov/pubmed/11818967) Quinche, Nicolas, Crime, Science et Identit. Anthologie des textes fondateurs de la criminalistique europenne (18601930). Genve: Slatkine, 2006, 368p., passim. Stigler, S. M. (2010). "Darwin, Galton and the Statistical Enlightenment". Journal of the Royal Statistical Society: Series A (Statistics in Society) 173 (3): 469482. doi: 10.1111/j.1467-985X.2010.00643.x (http://dx.doi.org/10. 1111/j.1467-985X.2010.00643.x).
External links
Galton's Complete Works (http://galton.org) at Galton.org (including all his published books, all his published scientific papers, and popular periodical and newspaper writing, as well as other previously unpublished work and biographical material). Works by Francis Galton (http://www.gutenberg.org/author/Francis+Galton) at Project Gutenberg The Galton Machine or Board demonstrating the normal distribution. (http://www.youtube.com/ watch?v=9xUBhhM4vbM) Portraits of Galton (http://www.npg.org.uk/live/search/person.asp?LinkID=mp01715) from the National Portrait Gallery (United Kingdom) The Galton laboratory homepage (http://www.gene.ucl.ac.uk/) Wikipedia:Link rot (originally The Francis Galton Laboratory of National Eugenics) at University College London O'Connor, John J.; Robertson, Edmund F., "Francis Galton" (http://www-history.mcs.st-andrews.ac.uk/ Biographies/Gillham.html), MacTutor History of Mathematics archive, University of St Andrews. Biography and bibliography (http://vlp.mpiwg-berlin.mpg.de/people/data?id=per78) in the Virtual Laboratory of the Max Planck Institute for the History of Science History and Mathematics (http://urss.ru/cgi-bin/db.pl?cp=&page=Book&id=53184&lang=en&blang=en& list=Found) Human Memory University of Amsterdam (http://memory.uva.nl/testpanel/gc/en/) website with test based on the work of Galton An 8-foot-tall (2.4m) Probability Machine (named Sir Francis Galton) comparing stock market returns to the randomness of the beans dropping through the quincunx pattern. (http://www.youtube.com/ watch?v=AUSKTk9ENzg) from Index Funds Advisors IFA.com (http://www.ifa.com)
Francis Galton Catalogue of the Galton papers held at UCL Archives (http://archives.ucl.ac.uk/DServe/dserve. exe?dsqServer=localhost&dsqIni=Dserve.ini&dsqApp=Archive&dsqCmd=Show.tcl&dsqDb=Catalog& dsqPos=2&dsqSearch=((text)='galton')) "Composite Portraits", by Francis Galton, 1878 (as published in the Journal of the Anthropological Institute of Great Britain and Ireland, volume 8). (http://www.galton.org/essays/1870-1879/ galton-1879-jaigi-composite-portraits.pdf) "Enquiries into Human Faculty and its Development", book by Francis Galton, 1883. (http://www.galton.org/ books/human-faculty/text/galton-1883-human-faculty-v4.pdf)
100
101
Mean crowding, i.e. the arithmetic mean of crowding measures averaged across individuals (this was called "Typical Group Size" according to Jarman's 1974 terminology); Confidence interval for mean crowding.
Colony size measures for rooks breeding in Normany. The distribution of colonies (vertical axis above) and the distribution of individuals (vertical axis below) across the size classes of colonies (horizontal axis). The number of individuals is given in pairs. Animal group size data tend to exhibit aggregated (right-skewed) distributions, i.e. most groups are small, a few are large, and a very few are very large. Note that average individuals live in colonies larger than the average colony size. (Data from Normandy, 1999-2000 (smoothed), Debout, 2003)
Statistical methods
Due to the aggregated (right-skewed) distribution of group members among groups, the application of parametric statistics would be misleading. Another problem arises when analyzing crowding values. Crowding data consist of nonindependent values, or ties, which show multiple and simultaneous changes due to a single biological event. (Say, all group members' crowding values change simultaneously whenever an individual joins or leaves.) The paper by Reiczigel et al. (2008) discusses the statistical problems associated with group size measures (calculating confidence intervals, 2-sample tests, etc.) and offers a free statistical toolset (Flocker 1.1) to handle them in a user-friendly manner.
102
Literature
Debout G 2003. Le corbeau freux (Corvus frugilegus) nicheur en Normandie: recensement 1999 & 2000. Cormoran, 13, 115121. Jarman PJ 1974. The social organisation of antelope in relation to their ecology. Behaviour, 48, 215268. Reiczigel J, Lang Z, Rzsa L, Tthmrsz B 2008. Measures of sociality: two different views of group size. [1] Animal Behaviour, 75, 715721.
External links
Flocker 1.1 a statistical toolset to analyze group size measures (with all the abovementioned calculations available) [2]
Gallery
An Aphid colony
Flamingos
Gannet colony
Common Coots
Elephant seals
Vicuas
Bottlenose dolphins
Sheep flock
103
References
[1] http:/ / www. zoologia. hu/ list/ AnimBehav. pdf [2] http:/ / www. zoologia. hu/ flocker/ flocker. html
Guttman scale
In statistical surveys conducted by means of structured interviews or questionnaires, a subset of the survey items having binary (e.g., YES or NO) answers forms a Guttman scale (named after Louis Guttman) if they can be ranked in some order so that, for a rational respondent, the response pattern can be captured by a single index on that ordered scale. In other words, on a Guttman scale, items are arranged in an order so that an individual who agrees with a particular item also agrees with items of lower rank-order. For example, a series of items could be (1) "I am willing to be near ice cream"; (2) "I am willing to smell ice cream"; (3) "I am willing to eat ice cream"; and (4) "I love to eat ice cream". Agreement with any one item implies agreement with the lower-order items. This contrasts with topics studied using a Likert scale or a Thurstone scale. The concept of Guttman scale likewise applies to series of items in other kinds of tests, such as achievement tests, that have binary outcomes. For example, a test of math achievement might order questions based on their difficulty and instruct the examinee to begin in the middle. The assumption is if the examinee can successfully answer items of that difficulty (e.g., summing two 3-digit numbers), s/he would be able to answer the earlier questions (e.g., summing two 2-digit numbers). Some achievement tests are organized in a Guttman scale to reduce the duration of the test. By designing surveys and tests such that they contain Guttman scales, researchers can simplify the analysis of the outcome of surveys, and increase the robustness. Guttman scales also make it possible to detect and discard randomized answer patterns, as may be given by uncooperative respondents. A hypothetical, perfect Guttman scale consists of a unidimensional set of items that are ranked in order of difficulty from least extreme to most extreme position. For example, a person scoring a "7" on a ten item Guttman scale, will agree with items 1-7 and disagree with items 8,9,10. An important property of Guttman's model is that a person's entire set of responses to all items can be predicted from their cumulative score because the model is deterministic. A well-known example of a Guttman scale is the Bogardus Social Distance Scale. Another example is the original Beaufort wind force scale, assigning a single number to observed conditions of the sea surface ("Flat", ..., "Small waves", ..., "Sea heaps up and foam begins to streak", ...), which was in fact a Guttman scale. The observation "Flat = YES" implies "Small waves = NO".
Deterministic model
An important objective in Guttman scaling is to maximize the reproducibility of response patterns from a single score. A good Guttman scale should have a coefficient of reproducibility (the percentage of original responses that could be reproduced by knowing the scale scores used to summarize them) above .85. Another commonly used metric for assessing the quality of a Guttman scale, is Menzel's coefficient of scalability and the coefficient of homogeneity (Loevinger, 1948; Cliff, 1977; Krus and Blackman, 1988). To maximize unidimensionality, misfitting items are re-written or discarded.
Guttman scale
104
Stochastic models
Guttman's deterministic model is brought within a probabilistic framework in item response theory models, and especially Rasch measurement. The Rasch model requires a probabilistic Guttman structure when items have dichotomous responses (e.g. right/wrong). In the Rasch model, the Guttman response pattern is the most probable response pattern for a person when items are ordered from least difficult to most difficult (Andrich, 1985). In addition, the Polytomous Rasch model is premised on a deterministic latent Guttman response subspace, and this is the basis for integer scoring in the model (Andrich, 1978, 2005). Analysis of data using item response theory requires comparatively longer instruments and larger datasets to scale item and person locations and evaluate the fit of data to model. In practice, actual data from respondents do not closely match Guttman's deterministic model. Several probabilistic models of Guttman implicatory scales were developed by Krus (1977) and Krus and Bart (1974).
Applications
The Guttman scale is used mostly when researchers want to design short questionnaires with good discriminating ability. The Guttman model works best for constructs that are hierarchical and highly structured such as social distance, organizational hierarchies, and evolutionary stages.
Unfolding models
A class of unidimensional models that contrast with Guttman's model are unfolding models. These models also assume unidimensionality but posit that the probability of endorsing an item is proportional to the distance between the items standing on the unidimensional trait and the standing of the respondent. For example, items like "I think immigration should be reduced" on a scale measuring attitude towards immigration would be unlikely to be endorsed both by those favoring open policies and also by those favoring no immigration at all. Such an item might be endorsed by someone in the middle of the continuum. Some researchers feel that many attitude items fit this unfolding model while most psychometric techniques are based on correlation or factor analysis, and thus implicitly assume a linear relationship between the trait and the response probability. The effect of using these techniques would be to only include the most extreme items, leaving attitude instruments with little precision to measure the trait standing of individuals in the middle of the continuum.
Example
Here is an example of a Guttman scale - the Bogardus Social Distance Scale: (Least extreme) 1. 2. 3. 4. 5. Are you willing to permit immigrants to live in your country? Are you willing to permit immigrants to live in your community? Are you willing to permit immigrants to live in your neighbourhood? Are you willing to permit immigrants to live next door to you? Would you permit your child to marry an immigrant?
(Most extreme) E.g., agreement with item 3 implies agreement with items 1 and 2.
Guttman scale
105
References
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 357-74. Andrich, D. (2005). The Rasch model explained. In Sivakumar Alagumalai, David D Durtis, and Njora Hungi (Eds.) Applied Rasch Measurement: A book of exemplars. Springer-Kluwer. Chapter 3, 308-328. Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In N. Brandon-Tuma (Ed.), Sociological Methodology, San Francisco, Jossey-Bass. (Chapter 2, pp.3380.). Cliff, N. (1977). A theory of consistency of ordering generalizable to tailored testing. Psychometrika, 42, 375-399. Gordon, R. (1977) Unidimensional Scaling of Social Variables: Concepts and Procedures. New York: The Free Press. Guttman, L. (1950). The basis for scalogram analysis. In Stouffer et al. Measurement and Prediction. The American Soldier Vol. IV. New York: Wiley Kenny D.A., Rubin D.C. (1977). Estimating chance reproducibility in Guttman scaling. Social Science Research, 6, 188-196. Krus, D.J. (1977) Order analysis: an inferential model of dimensional analysis and scaling. Educational and Psychological Measurement, 37, 587-601. (Request reprint). [1] Krus, D. J., & Bart, W. M. (1974) An ordering theoretic method of multidimensional scaling of items. Educational and Psychological Measurement, 34, 525-535. Krus, D.J., & Blackman, H.S. (1988).Test reliability and homogeneity from perspective of the ordinal test theory. Applied Measurement in Education, 1, 79-88 (Request reprint). [2] Loevinger, J. (1948). The technic of homogeneous tests compared with some aspects of scale analysis and factor analysis. Psychological Bulletin, 45, 507-529. Robinson J. P. (1972) Toward a More Appropriate Use of Guttman Scaling. Public Opinion Quarterly, Vol. 37:(2). (Summer, 1973), pp.260267. Schooler C. (1968). A Note of Extreme Caution on the Use of Guttman Scales. American Journal of Sociology, Vol. 74:(3) (Nov. 1968), 296-301.
External links
Guttman scaling description [3]
References
[1] http:/ / www. visualstatistics. net/ Scaling/ Order%20Analysis/ Order%20Analysis. htm [2] http:/ / www. visualstatistics. net/ Scaling/ Homogeneity/ Homogeneity. htm [3] http:/ / www. socialresearchmethods. net/ kb/ scalgutt. htm
High-stakes testing
106
High-stakes testing
A high-stakes test is a test with important consequences for the test taker.[1] Passing has important benefits, such as a high school diploma, a scholarship, or a license to practice a profession. Failing has important disadvantages, such as being forced to take remedial classes until the test can be passed, not being allowed to drive a car, or not being able to find employment. The use and misuse of high-stakes tests are a controversial topic in public education, especially in the United States where they have become especially popular in recent years, used not only to assess students but in attempts to increase teacher [2] accountability.
A driving test is a high-stakes test: Without passing the test, the test taker cannot obtain a driver's license.
Definitions
In common usage, a high-stakes test is any test that has major consequences or is the basis of a major decision.[1][][] Under a more precise definition, a high-stakes test is any test that: is a single, defined assessment, has clear line drawn between those who pass and those who fail, and has direct consequences for passing or failing (something "at stake").[] High-stakes testing is not synonymous with high-pressure testing. An American high school student might feel pressure to perform well on the SAT-I college aptitude exam. However, SAT scores do not directly determine admission to any college or university, and there is no clear line drawn between those who pass and those who fail, so it is not formally considered a high-stakes test.[3][4] On the other hand, because the SAT-I scores are given significant weight in the admissions process at some schools, many people believe that it has consequences for doing well or poorly and is therefore a high-stakes test under the simpler, common definition.[5][6]
The stakes
High stakes are not a characteristic of the test itself, but rather of the consequences placed on the outcome. For example, no matter what test is used written multiple choice, oral examination, performance test a medical licensing test must be passed to practice medicine. The perception of the stakes may vary. For example, college students who wish to skip an introductory-level course are often given exams to see whether they have already mastered the material and can be passed to the next level. Passing the exam can reduce tuition costs and time spent at university. A student who is anxious to have these benefits may consider the test to be a high-stakes exam. Another student, who places no importance on the outcome, so long as he is placed in a class that is appropriate to his skill level, may consider the same exam to be a low-stakes test.[] The phrase "high stakes" is derived directly from a gambling term. In gambling, a stake is the quantity of money or other goods that is risked on the outcome of some specific event. A high-stakes game is one in which, in the player's
High-stakes testing personal opinion, a large quantity of money is being risked. The term is meant to imply that implementing such a system introduces uncertainty and potential losses for test takers,[citation needed] who must pass the exam to "win," instead of being able to obtain the goal through other means.[citation needed] Examples of high-stakes tests and their "stakes" include: Driver's license tests and the legal ability to drive Theater auditions and the part in the performance College entrance examinations in some countries, such as Japan's Common first-stage exam, and admission to a high-quality university Many job interviews or drug tests and being hired High school exit examinations and high-school diplomas Progression from one grade to another grade in primary and secondary high school No Child Left Behind tests and school funding and ratings Ph.D. oral exams and the dissertation Professional licensing and certification examinations (such as the bar exams, FAA written tests, and medical exams) and the license or certification being sought The Test of English as a Foreign Language (TOEFL) and recognition as a speaker of English (if a minimum score is required, but not if it is used merely for information [normally in work and school placement contexts])
107
Stakeholders
A high-stakes system may be intended to benefit people other than the test-taker. For professional certification and licensure examinations, the purpose of the test is to protect the general public from incompetent practitioners. The individual stakes of the medical student and the medical school are, hopefully, balanced against the social stakes of possibly allowing an incompetent doctor to practice medicine.[7] A test may be "high-stakes" based on consequences for others beyond the individual test-taker.[] For example, an individual medical student who fails a licensing exam will not be able to practice his or her profession. However, if enough students at the same school fail the exam, then the school's reputation and accreditation may be in jeopardy. Similarly, testing under the U.S.'s No Child Left Behind Act has no direct negative consequences for failing students,[8] but potentially serious consequences for their schools, including loss of accreditation, funding, teacher pay, teacher employment, or changes to the school's management.[9] The stakes are therefore high for the school, but low for the individual test-takers.
High-stakes testing
108
Criticism
High-stakes tests are often criticized for the following reasons: The test does not correctly measure the individual's knowledge or skills. For example, a test might purport to be a general reading-skills test, but it might actually determine whether or not the examinee has read a specific book. The test may not measure what the critic wants measured. For example, a test might accurately measure whether a law student has acquired fundamental knowledge of the legal system, but the critic might want students to be tested on legal ethics instead of legal knowledge. Testing causes stress for some students. Critics suggest that since some people perform poorly under the pressure associated with tests, any test is likely to be less representative of their actual standard of achievement than a non-test alternative.[] This is called test anxiety or performance anxiety. High-stakes tests are often given as a single long exam. Some critics prefer continuous assessment instead of one larger test. For example, the American Psychological Association (APA) opposes high school exit examinations, saying, "Any decision about a student's continued education, such as retention, tracking, or graduation, should not be based on the results of a single test, but should include other relevant and valid information."[] Since the stakes are related to consequences, not method, however, short tests can also be high-stakes. High-stakes testing creates more incentive for cheating.[] Because cheating on a single critical exam may be easier than either learning the required material or earning credit through attendance, diligence, or many smaller tests, more examinees that do not actually have the necessary knowledge or skills, but who are effective cheaters, may pass. Also, some people who would otherwise pass the test but are not confident enough of themselves might decide to additionally secure the outcome by cheating, get caught and often face even worse consequences than just failing. Additionally, if the test results are used to determine the teachers' pay or continued employment, or to evaluate the school, then school personnel may fraudulently alter student test papers to artificially inflate student performance.[] Sometimes a high-stakes test is tied to a controversial reward. For example, some people may want a high-school diploma to represent the verified acquisition of specific skills or knowledge, and therefore use a high-stakes assessment to deny a diploma to anyone who cannot perform the necessary skills.[] Others may want a high school diploma to represent primarily a certificate of attendance, so that a student who faithfully attended school but cannot read or write will still get the social benefits of graduation. [citation needed] This use of tests to deny a high school diploma, and thereby access to most jobs and higher education for a lifetime is controversial even when the test itself accurately identifies students that do not have the necessary skills. Criticism is usually framed as over-reliance on a single measurement[10] or in terms of social justice, if the absence of skill is not entirely the test taker's fault, as in the case of a student who cannot read because of unqualified teachers, or an elderly person with advanced dementia that can no longer pass a driving exam due to loss of cognitive function.[] Tests can penalize test takers that do not have the necessary skills through no fault of their own. An absence of skill may not be the test taker's fault, but high-stakes test measure only skill proficiency, regardless of whether the test takers had an equal opportunity to learn the material.[][][11] Additionally, wealthy students may use private tutoring or test preparation programs to improve their scores. Some affluent parents pay thousands of dollars to prepare their children for tests.[12] Critics see this as being unfair to students who cannot afford additional educational services. High-stakes tests reveal that some examinees do not know the required material, or do not have the necessary skills. While failing these people may have many public benefits, the consequences of repeated failure can be very high for the individual. For example, a person who fails a practical driving exam will not be able to drive a car legally, which means they cannot drive to work and may lose their job if alternative transportation options are not available. The person may suffer social embarrassment when his acquaintances discover that his
High-stakes testing lack of skill resulted in loss of his driver's license. In the context of high school exit exams, poorly performing school districts have formally opposed high-stakes testing after low test results, which accurately and publicly exposed the districts' failures, proved to be politically embarrassing,[13] and criticized high-stakes tests for identifying students who lack the required knowledge.[]
109
References
[2] Rosemary Sutton & Kelvin Seifert (2009) Educational Psychology, 2nd Edition: Chapter 1: The Changing Teaching Profession and You. pp 14 (http:/ / www. saylor. org/ site/ wp-content/ uploads/ 2012/ 06/ Educational-Psychology. pdf) [7] Mehrens, W.A. (1995). Legal and Professional Bases for Licensure Testing.' In Impara, J.C. (Ed.) Licensure testing: Purposes, procedures, and practices, pp. 33-58. Lincoln, NE: Buros Institute.
Further reading
Featherston, Mark Davis, 2011. "High-Stakes Testing Policy in Texas: Describing the Attitudes of Young College Graduates." (http://ecommons.txstate.edu/arp/350) Applied Research Projects, Texas State University-San Marcos.
Historiometry
Historiometry is the historical study of human progress or individual personal characteristics, using statistics to analyze references to geniuses,[1] their statements, behavior and discoveries in relatively neutral texts. Historiometry combines techniques from cliometrics, which studies the history of economics and from psychometrics, the psychological study of an individual's personality and abilities.
Origins
Historiometry started in the early 19th century with studies on the relationship between age and achievement by Belgian mathematician Adolphe Quetelet in the careers of prominent French and English playwrights [2][3] but it was Sir Francis Galton, a pioneering English eugenist who popularized historiometry in his 1869 work, Hereditary Genius.[4] It was further developed by Frederick Adams Woods (who coined the term historiometry[5][6]) in the beginning of the 20th century.[7] Also psychologist Paul E. Meehl published several papers on historiometry later in his career, mainly in the area of medical history, although it is usually referred to as cliometric metatheory by him.[8][9] Historiometry was the first field studying genius by using scientific methods.[1]
Historiometry
110
Current research
Prominent current historiometry researchers include Dean Keith Simonton and Charles Murray.[] Historiometry is defined by Dean Keith Simonton as: a quantitative method of statistical analysis for retrospective data. In Simonton's work the raw data comes from psychometric assessment of famous personalities, often already deceased, in an attempt to assess creativity, genius and talent development.[10] Charles Murray's Human Accomplishment is one example of this approach to quantify the impact of individuals on technology, science and the arts. It tracks the most important achievements across time, and for the different peoples of the world, and provides a thorough discussion of the methodology used, together with an assessment of its reliability and accuracy.[]Wikipedia:No original research
Examples of research
Since historiometry deals with subjective personal traits as creativity, charisma or openness most studies deal with the comparison of scientists, artists or politicians. The study (Human Accomplishment) by Charles Murray classifies, for example, Einstein and Newton as the most important physicists and Michelangelo as the top ranking western artist.[] As another example, several studies have compared charisma and even the IQ of presidents and presidential candidates of the United States of America.[11][] The latter study classifies John Quincy Adams as the most clever US president, with an estimated IQ between 165 to 175.[]
Critique
Since historiometry is based on indirect information like historic documents and relies heavily on statistics, the results of these studies are questioned by some researchers, mainly because of concerns about over-interpretation of the estimated results.[12][13] The previously mentioned study of the intellectual capacity of US presidents, a study by Dean Keith Simonton, attracted a lot of media attention and critique mainly because it classified the former US president, George W. Bush, as second to last of all US presidents since 1900.[][14] The IQ of G.W. Bush was estimated as between 111.1 and 138.5, with an average of 125,[] exceeding only that of president Warren Harding, who is regarded as a failed president,[] with an average IQ of 124. Although controversial and imprecise (due to gaps in available data), the approach used by Simonton to generate his results was regarded "reasonable" by fellow researchers.[15] In the media, the study was sometimes compared with the U.S. Presidents IQ hoax, a hoax that circulated via email in mid-2001, which suggested that G.W. Bush had the lowest IQ of all US presidents.[16]
References
[1] A Reflective Conversation with Dean Keith Simonton, North American Journal of Psychology, 2008, Vol. 10, No. 3, 595-602.
External links
History and Mathematics (http://urss.ru/cgi-bin/db.pl?cp=&page=Book&id=53184&lang=en&blang=en& list=Found)
House-Tree-Person test
111
House-Tree-Person test
The House-Tree-Person test (HTP) is a projective test designed to measure aspects of a persons personality. The test can also be used to assess brain damage and general mental functioning. The test is a diagnostic tool for clinical psychologists, educators, and employers. The subject receives a short, unclear instruction (the stimulus) to draw a house, a tree, and the figure of a person. Once the subject is done, he is asked to describe the pictures that he has done. The assumption is that when the subject is drawing he is projecting his inner world onto the page. The administrator of the test uses tools and skills that have been established for the purpose of investigating the subject's inner world through the drawings.
Generally this test is administered as part of a series of personality and intelligence tests, like the Rorschach, TAT (or CAT for children), Bender, and Wechsler tests. The examiner integrates the results of these tests, creating a basis for evaluating the subject's personality from a cognitive, emotional, intra- and interpersonal perspective. The test and its method of administration have been criticized for having substantial weaknesses in validity, but a number of researchers in the past few decades have found positive results as regards its validity for specific populations. [citation needed]
History
HTP was designed by John Buck and was originally based on the Goodenough scale of intellectual functioning. The HTP was developed in 1948, and updated in 1969. Buck included both qualitative and quantitative measurements of intellectual ability in the HTP (V). A 350-page manual was written by Buck to instruct the test-giver on proper grading of the HTP, which is more subjective than quantitative.[] In contrast with him, Zoltn Vass published a more sophisticated approach, based on system analysis (SSCA, Seven-Step Configuration Analysis [1]).
House-Tree-Person test After the Person is drawn: who is the person? How old is the person? What do they like and dislike doing? Has anyone tried to hurt them? Who looks out for them?[]
112
Interpretation of results
By virtue of being a projective test, the results of the HTP are subjective and open to interpretation by the administrator of the exam.[] The subjective analysis of the test takers responses and drawings aims to make inferences of personality traits and past experiences. The subjective nature of this aspect of the HTP, as with other qualitative tests, has little empirical evidence to support its reliability or validity. This test, however, is still considered an accurate measure of brain damage and used in the assessment of schizophrenic patients also suffering from brain damage.[] In addition, the quantitative measure of intelligence for the House-tree-person has been shown to highly correlate with the WAIS and other well-established intelligence tests.[]
References
[1] http:/ / www. freado. com/ read/ 11970/ a-psychological-interpretation-of-drawings-and-paintings
Idiographic image
In the field of clinical sciences, an idiographic image (from Greek -: dios + graphiks, meaning "to describe a peculiarity") is the representation of a result which has been obtained thanks to a study or research method whose subject-matters are specific cases, i.e. a portrayal which avoids nomothetic generalizations. "Diagnostic formulation follows an idiographic criterion, while diagnostic classification follows a nomothetic criterion".[1] In the field of psychiatry, psychology and clinical psychopathology, idiographic criterion is a method (also called historical method) which involves evaluating past experiences and selecting and comparing information about a specific individual or event. An example of idiographic image is a report, diagram or health history showing medical, psychological and pathological features which make the subject under examination unique. "Where there is no prior detailed presentation of clinical data, the summary should present sufficient relevant information to support the diagnostic and aetiological components of the formulation. The term diagnostic formulation is preferable to diagnosis, because it emphasises that matters of clinical concern about which the clinician proposes aetiological hypotheses and targets of intervention include much more than just diagnostic category assignment, though this is usually an important component".[2] The expression idiographic image appeared for the first time in 1996 in the SESAMO research method Manual.[3] This term was coined to mean that the report of the test provided an anamnestic report containing a family, relational and health history of the subject and providing semiological data regarding both the psychosexual and the social-affective profile. These profiles were useful to the clinician in order to formulate pathogenetic and pathognomonic hypotheses.[4]
Idiographic image
113
Bibliography
[1] Battacchi M.W., (1990), Trattato enciclopedico di psicologia dell'et evolutiva, Piccin, Padova. ISBN 88-299-0206-3 [2] Shields R., Emergency psychiatry. Review of psychiatry. Australian and New Zealand Journal of Psychiatry, 37, 4, 498-499, 2003. (http:/ / member. melbpc. org. au/ ~rshields/ psychiatricformulation. html) [3] Boccadoro L., (1996) SESAMO: Sexuality Evaluation Schedule Assessment Monitoring. Approccio differenziale al profilo idiografico psicosessuale e socioaffettivo. O.S., Firenze. IT\ICCU\CFI\0327719 [4] Boccadoro L., Carulli S., (2008) The place of the denied love. Sexuality and secret psychopathologies (Abstract English, Spanish, Italian) (http:/ / sexology. interfree. it/ abstract_english. html). Edizioni Tecnoprint, Ancona. ISBN 978-88-95554-03-7
External links
Glossario di Sessuologia clinica (Italian)[[Category:Articles with Italian language external links (http:// sexology.it/glossario_sessuologia.html)] - Glossary of clinical sexology (English)]
Intelligence quotient
114
Intelligence quotient
Intelligence quotient
Diagnostics
An example of one kind of IQ test item, modeled after items in the Raven's Progressive Matrices test. ICD-9-CM MedlinePlus 94.01 [1] [2]
001912
Human intelligence
Abilities, traits and constructs
Abstract thought Communication Creativity Emotional intelligence g factor Intelligence quotient Knowledge Learning Memory Problem solving Reaction time Reasoning Understanding Visual processing Models and theories
CattellHornCarroll theory Fluid and crystallized intelligence Theory of multiple intelligences Three stratum theory Triarchic theory of intelligence PASS theory of intelligence Fields of study
Intelligence quotient
115
Cognitive epidemiology Evolution of human intelligence Psychometrics Heritability of IQ Impact of health on intelligence Environment and intelligence Neuroscience and intelligence Race and intelligence Religiosity and intelligence
An intelligence quotient, or IQ, is a score derived from one of several standardized tests designed to assess intelligence. The abbreviation "IQ" comes from the German term Intelligenz-Quotient, originally coined by psychologist William Stern. When modern IQ tests are devised, the mean (average) score within an age group is set to 100 and the standard deviation (SD) almost always to 15, although this was not always so historically.[] Thus, the intention is that approximately 95% of the population scores within two SDs of the mean, i.e. has an IQ between 70 and 130. IQ scores have been shown to be associated with such factors as morbidity and mortality,[3] parental social status,[4] and, to a substantial degree, biological parental IQ. While the heritability of IQ has been investigated for nearly a century, there is still debate about the significance of heritability estimates[5][6] and the mechanisms of inheritance.[] IQ scores are used as predictors of educational achievement, special needs, job performance and income. They are also used to study IQ distributions in populations and the correlations between IQ and other variables. The average IQ scores for many populations have been rising at an average rate of three points per decade since the early 20th century, a phenomenon called the Flynn effect. It is disputed whether these changes in scores reflect real changes in intellectual abilities.
History
Early history
The first large-scale mental test may have been the imperial examination system in China. According to psychologist Robert Sternberg, the ancient Chinese game known in the West as the tangram was designed to evaluate a person's intelligence, along with the game jiulianhuan or nine linked rings.[] Sternberg states that it is considered "the earliest psychological test in the world," although one made for entertainment rather than analysis.[] Modern mental testing began in France in the 19th century. It contributed to separating mental retardation from mental illness and reducing the neglect, torture, and ridicule heaped on both groups.[7] Englishman Francis Galton coined the terms psychometrics and eugenics, and developed a method for measuring intelligence based on nonverbal sensory-motor tests. It was initially popular, but was abandoned after the discovery that it had no relationship to outcomes such as college grades.[7][8] French psychologist Alfred Binet, together with psychologists Victor Henri and Thodore Simon, after about 15 years of development, published the Binet-Simon test in 1905, which focused on verbal abilities. It was intended to identify mental retardation in school children.[7] The score on the Binet-Simon scale would reveal the child's mental age. For example, a six-year-old child who passed all the tasks usually passed by six-year-oldsbut nothing beyondwould have a mental age that exactly matched his chronological age, 6.0. (Fancher, 1985). In Binet's view, there were limitations with the scale and he stressed what he saw as the remarkable diversity of intelligence and the subsequent need to study it using qualitative, as opposed to quantitative, measures (White, 2000). American psychologist Henry H. Goddard published a translation of it in 1910. The eugenics movement in the USA seized on it as a means to give them credibility in diagnosing mental retardation, and thousands of American women, most of
Intelligence quotient them poor African Americans, were forcibly sterilized based on their scores on IQ tests, often without their consent or knowledge.[9] American psychologist Lewis Terman at Stanford University revised the Binet-Simon scale, which resulted in the Stanford-Binet Intelligence Scales (1916). It became the most popular test in the United States for decades.[7][10][11][12]
116
CattellHornCarroll theory
Raymond Cattell (1941) proposed two types of cognitive abilities in a revision of Spearman's concept of general intelligence. Fluid intelligence (Gf) was hypothesized as the ability to solve novel problems by using reasoning, and crystallized intelligence (Gc) was hypothesized as a knowledge-based ability that was very dependent on education and experience. In addition, fluid intelligence was hypothesized to decline with age, while crystallized intelligence was largely resistant. The theory was almost forgotten, but was revived by his student John L. Horn (1966) who later argued Gf and Gc were only two among several factors, and he eventually identified 9 or 10 broad abilities. The theory continued to be called Gf-Gc theory.[7] John B. Carroll (1993), after a comprehensive reanalysis of earlier data, proposed the Three Stratum theory, which is a hierarchical model with three levels. The bottom stratum consists of narrow abilities that are highly specialized
Intelligence quotient (e.g., induction, spelling ability). The second stratum consists of broad abilities. Carroll identified eight second-stratum abilities. Carroll accepted Spearman's concept of general intelligence, for the most part, as a representation of the uppermost, third stratum.[][] More recently (1999), a merging of the Gf-Gc theory of Cattell and Horn with Carroll's Three-Stratum theory has led to the CattellHornCarroll theory. It has greatly influenced many of the current broad IQ tests.[7] It is argued that this reflects much of what is known about intelligence from research. A hierarchy of factors is used; g is at the top. Under it are 10 broad abilities that in turn are subdivided into 70 narrow abilities. The broad abilities are:[7] Fluid intelligence (Gf) includes the broad ability to reason, form concepts, and solve problems using unfamiliar information or novel procedures. Crystallized intelligence (Gc) includes the breadth and depth of a person's acquired knowledge, the ability to communicate one's knowledge, and the ability to reason using previously learned experiences or procedures. Quantitative reasoning (Gq) is the ability to comprehend quantitative concepts and relationships and to manipulate numerical symbols. Reading and writing ability (Grw) includes basic reading and writing skills. Short-term memory (Gsm) is the ability to apprehend and hold information in immediate awareness, and then use it within a few seconds. Long-term storage and retrieval (Glr) is the ability to store information and fluently retrieve it later in the process of thinking. Visual processing (Gv) is the ability to perceive, analyze, synthesize, and think with visual patterns, including the ability to store and recall visual representations. Auditory processing (Ga) is the ability to analyze, synthesize, and discriminate auditory stimuli, including the ability to process and discriminate speech sounds that may be presented under distorted conditions. Processing speed (Gs) is the ability to perform automatic cognitive tasks, particularly when measured under pressure to maintain focused attention. Decision/reaction time/speed (Gt)reflects the immediacy with which an individual can react to stimuli or a task (typically measured in seconds or fractions of seconds; it is not to be confused with Gs, which typically is measured in intervals of 23 minutes). See Mental chronometry. Modern tests do not necessarily measure of all of these broad abilities. For example, Gq and Grw may be seen as measures of school achievement and not IQ.[7] Gt may be difficult to measure without special equipment. g was earlier often subdivided into only Gf and Gc, which were thought to correspond to the nonverbal or performance subtests and verbal subtests in earlier versions of the popular Wechsler IQ test. More recent research has shown the situation to be more complex.[7] Modern comprehensive IQ tests no longer give a single score. Although they still give an overall score, they now also give scores for many of these more restricted abilities, identifying particular strengths and weaknesses of an individual.[7]
117
Other theories
J.P. Guilford's Structure of Intellect (1967) model used three dimensions which when combined yielded a total of 120 types of intelligence. It was popular in the 1970s and early 1980s, but faded due to both practical problems and theoretical criticisms.[7] Alexander Luria's earlier work on neuropsychological processes lead to the PASS theory (1997). It argued that only looking at one general factor was inadequate for researchers and clinicians who worked with learning disabilities, attention disorders, mental retardation, and interventions for such disabilities. The PASS model covers four kinds of processes (planning process, attention/arousal process, simultaneous processing, and successive processing). The planning processes involve decision making, problem solving, and performing activities and requires goal setting
Intelligence quotient and self-monitoring. The attention/arousal process involves selectively attending to a particular stimulus, ignoring distractions, and maintaining vigilance. Simultaneous processing involves the integration of stimuli into a group and requires the observation of relationships. Successive processing involves the integration of stimuli into serial order. The planning and attention/arousal components comes from structures located in the frontal lobe, and the simultaneous and successive processes come from structures located in the posterior region of the cortex.[][][] It has influenced some recent IQ tests, and been seen as a complement to the Cattell-Horn-Carroll theory described above.[7]
118
Modern tests
Well-known modern IQ tests include Raven's Progressive Matrices, Wechsler Adult Intelligence Scale, Wechsler Intelligence Scale for Children, Stanford-Binet, Woodcock-Johnson Tests of Cognitive Abilities, and Kaufman Assessment Battery for Children. Approximately 95% of the population have scores within two standard deviations (SD) of the mean. If one SD is 15 points, as is common in almost all modern tests, then 95% of the population are within a range of 70 to 130, and 98% are below 131. Alternatively, two-thirds of the population have IQ scores within one SD of the mean, i.e. within the range 85-115. IQ scales are ordinally scaled.[15][16][17][18] While one standard deviation is 15 points, and two SDs are 30 points, and so on, this does not imply that mental ability is linearly related to IQ, such that IQ 50 means half the cognitive ability of IQ 100. In particular, IQ points are not percentage points. The correlation between IQ test results and achievement test results is about 0.7.[7][19]
Intelligence quotient The values of 100 and 15 were chosen to get somewhat similar scores as in the older type of test. Likely as a part of the rivalry between the Binet and the Wechsler, the Binet until 2003 chose to have 16 for one SD, causing considerable confusion. Today, almost all tests use 15 for one SD. Modern scores are sometimes referred to as "deviation IQs," while older method age-specific scores are referred to as "ratio IQs."[7][21]
119
Flynn effect
Since the early 20th century, raw scores on IQ tests have increased in most parts of the world.[][][] When a new version of an IQ test is normed, the standard scoring is set so performance at the population median results in a score of IQ 100. The phenomenon of rising raw score performance means if test-takers are scored by a constant standard scoring rule, IQ test scores have been rising at an average rate of around three IQ points per decade. This phenomenon was named the Flynn effect in the book The Bell Curve after James R. Flynn, the author who did the most to bring this phenomenon to the attention of psychologists.[][]
Intelligence quotient Researchers have been exploring the issue of whether the Flynn effect is equally strong on performance of all kinds of IQ test items, whether the effect may have ended in some developed nations, whether or not there are social subgroup differences in the effect, and what possible causes of the effect might be.[] Flynn's observations have prompted much new research in psychology and "demolish some long-cherished beliefs, and raise a number of other interesting issues along the way."[]
120
IQ and age
IQ can change to some degree over the course of childhood.[24] However, in one longitudinal study, the mean IQ scores of tests at ages 17 and 18 were correlated at r=.86 with the mean scores of tests at ages five, six and seven and at r=.96 with the mean scores of tests at ages 11, 12 and 13.[25] IQ scores for children are relative to children of a similar age. That is, a child of a certain age does not do as well on the tests as an older child or an adult with the same IQ. But, relative to persons of a similar age, or other adults in the case of adults, they do equally well if the IQ scores are the same.[25] To convert a child's IQ score into an adult score the following calculation should be made: . The number 16 is used to indicate the age at which supposedly the IQ reaches its peak.[26] For decades practitioners' handbooks and textbooks on IQ testing have reported IQ declines with age after the beginning of adulthood. However, later researchers pointed out this phenomenon is related to the Flynn effect and is in part a cohort effect rather than a true aging effect. A variety of studies of IQ and aging have been conducted since the norming of the first Wechsler Intelligence Scale drew attention to IQ differences in different age groups of adults. Current consensus is that fluid intelligence generally declines with age after early adulthood, while crystallized intelligence remains intact. Both cohort effects (the birth year of the test-takers) and practice effects (test-takers taking the same form of IQ test more than once) must be controlled to gain accurate data. It is unclear whether any lifestyle intervention can preserve fluid intelligence into older ages.[27] The exact peak age of fluid intelligence or crystallized intelligence remains elusive. Cross-sectional studies usually show that especially fluid intelligence peaks at a relatively young age (often in the early adulthood) while longitudinal data mostly show that intelligence is stable until the mid adulthood or later. Subsequently, intelligence seems to decline slowly.[]
Heritability
Heritability is defined as the proportion of variance in a trait which is attributable to genotype within a defined population in a specific environment. A number of points must be considered when interpreting heritability.[28] Heritability measures the proportion of 'variation' in a trait can be attributed to genes, and not the proportion of a trait caused by genes. The value of heritability can change if the impact of environment (or of genes) in the population is substantially altered. A high heritability of a trait does not mean environmental effects, such as learning, are not involved. Since heritability increases during childhood and adolescence, one should be cautious drawing conclusions regarding the role of genetics and environment from studies where the participants are not followed until they are adults. Studies have found the heritability of IQ in adult twins to be 0.7 to 0.8 and in children twins 0.45 in the Western world.[25][29][30] It may seem reasonable to expect genetic influences on traits like IQ should become less important as one gains experiences with age. However, the opposite occurs. Heritability measures in infancy are as low as 0.2,
Intelligence quotient around 0.4 in middle childhood, and as high as 0.8 in adulthood.[] One proposed explanation is that people with different genes tend to reinforce the effects of those genes, for example by seeking out different environments.[25] Debate is ongoing about whether these heritability estimates are too high due to not adequately considering various factors, such as that the environment may be relatively more important in families with low socioeconomic status or the effect of the maternal (fetal) environment. Recent research suggests that molecular genetics of psychology and social science requires approaches that go beyond the examination of candidate genes.[31]
121
Individual genes
A very large proportion of the over 17,000 human genes are thought to have an impact on the development and functionality of the brain.[34] While a number of individual genes have been reported to be associated with IQ. Examples include CHRM2, microcephalin, and ASPM. However, Deary and colleagues (2009) argued no evidence has been replicated.,[35] a finding supported by Chabris et al (2012)[36] Recently, FNBP1L polymorphisms, specifically the SNP rs236330 have been associated with normally varying intelligence differences in adults [] and in children.[37]
Gene-environment interaction
David Rowe reported an interaction of genetic effects with Socioeconomic Status, such that the heritability was high in high-SES families, but much lower in low-SES families.[] This has been replicated in infants,[38] children [39] and adolescents [40] in the US, though not outside the US, for instance a reverse result was reported in the UK. [] Dickens and Flynn (2001) have argued that genes for high IQ initiate environment shaping feedback, as genetic effects cause bright children to seek out more stimulating environments that further increase IQ. In their model, an environment effects decay over time (the model could be adapted to include possible factors, like nutrition in early childhood, that may cause permanent effects). The Flynn effect can be explained by a generally more stimulating environment for all people. The authors suggest that programs aiming to increase IQ would be most likely to produce long-term IQ gains if they caused children to persist in seeking out cognitively demanding experiences.[][41]
Intelligence quotient
122
Interventions
In general, educational interventions, as those described below, have shown short-term effects on IQ, but long-term follow-up is often missing. For example, in the US very large intervention programs such as the Head Start Program have not produced lasting gains in IQ scores. More intensive, but much smaller, projects Abecedarian Project have reported lasting effects, often on Socioeconomic status variables, rather than IQ.[25] A placebo controlled double-blind experiment found that vegetarians who took 5grams of creatine per day for six weeks showed a significant improvement on two separate tests of fluid intelligence, Raven's Progressive Matrices, and the backward digit span test from the WAIS. The treatment group was able to repeat longer sequences of numbers from memory and had higher overall IQ scores than the control group. The researchers concluded that "supplementation with creatine significantly increased intelligence compared with placebo."[42] A subsequent study found that creatine supplements improved cognitive ability in the elderly.[43] However, a study on young adults (0.03 g/kg/day for six weeks, e.g., 2 g/day for 150-pound individual) failed to find any improvements.[44] Recent studies have shown that training in using one's working memory may increase IQ. A study on young adults published in April 2008 by a team from the Universities of Michigan and Bern supports the possibility of the transfer of fluid intelligence from specifically designed working memory training.[45][] Further research will be needed to determine nature, extent and duration of the proposed transfer. Among other questions, it remains to be seen whether the results extend to other kinds of fluid intelligence tests than the matrix test used in the study, and if so, whether, after training, fluid intelligence measures retain their correlation with educational and occupational achievement or if the value of fluid intelligence for predicting performance on other tasks changes. It is also unclear whether the training is durable of extended periods of time.[46]
Music and IQ
Musical training in childhood has been found to correlate with higher than average IQ.[] In a 2004 study indicated that 6 year old children who received musical training (voice or piano lessons) had an average increase in IQ of 7.0 points while children who received alternative training (i.e. drama) or no training had an average increase in IQ of only 4.3 points (which may be consequence of the children entering grade school) as indicated by full scale IQ. Children were tested using Wechsler Intelligence Scale for ChildrenThird Edition, Kaufman Test of Educational Achievement and Parent Rating Scale of the Behavioral Assessment System for Children.[] Listening to classical music was reported to increase IQ; specifically spatial ability. In 1994 Frances Rauscher and Gorden Shaw reported that college students who listened to 10 minutes of Mozart's Sonata for Two Pianos, showed an increase in IQ of 8 to 9 points on the spatial subtest on the Standford-Binet Intelligence Scale.[47] The phenomenon was coined the Mozart effect. Multiple attempted replications (e.g.[48]) have shown that this is at best a short-term effect (lasting no longer than 10 to 15 minutes), and is not related to IQ-increase.[49]
Music lessons
In 2004, Schellenberg devised an experiment to test his hypothesis that music lessons can enhance the IQ of children. He had 144 samples of 6 year old children which were put into 4 groups; keyboard lessons, vocal lessons, drama lessons or no lessons at all, for 36 weeks. The samples' IQ was measured both before and after the lessons had taken place using the Wechsler Intelligence Scale for ChildrenThird Edition, Kaufman Test of Educational Achievement and Parent Rating Scale of the Behavioral Assessment System for Children. All four groups had increases in IQ, most likely resulted by the entrance of grade school. The notable difference with the two music groups compared to the two controlled groups was a slightly higher increase in IQ. The children in the control groups on average had an increase in IQ of 4.3 points, while the increase in IQ of the music groups was 7.0 points. Though the increases in IQ were not dramatic, one can still conclude that musical lessons do have a positive effect
Intelligence quotient for children, if taken at a young age. It is hypothesized that improvements in IQ occur after musical lessons because the music lessons encourage multiple experiences which generates progression in a wide range of abilities for the children. Testing this hypothesis however, has proven difficult.[50] Another test also performed by Schellenberg tested the effects of musical training in adulthood. He had two groups of adults, one group whom were musically trained and another group who were not. He administered tests of intelligence quotient and emotional intelligence to the trained and non-trained groups and found that the trained participants had an advantage in IQ over the untrained subjects even with gender, age, environmental issues (e.g. income, parent's education) held constant. The two groups, however, score similarly in the emotional intelligence test. The test results (like the previous results) show that there is a positive correlation between musical training and IQ, but it is not evident that musical training has a positive effect on emotional intelligence.[51]
123
Health and IQ
Health is important in understanding differences in IQ test scores and other measures of cognitive ability. Several factors can lead to significant cognitive impairment, particularly if they occur during pregnancy and childhood when the brain is growing and the bloodbrain barrier is less effective. Such impairment may sometimes be permanent, sometimes be partially or wholly compensated for by later growth. [citation needed] Developed nations have implemented several health policies regarding nutrients and toxins known to influence cognitive function. These include laws requiring fortification of certain food products and laws establishing safe levels of pollutants (e.g. lead, mercury, and organochlorides). Improvements in nutrition, and in public policy in general, have been implicated in worldwide IQ increases. [citation needed] Cognitive epidemiology is a field of research that examines the associations between intelligence test scores and health. Researchers in the field argue that intelligence measured at an early age is an important predictor of later health and mortality differences.
Social outcomes
Intelligence is a better predictor of educational and work success than any other single score.[] Some measures of educational SAT aptitude are essentially IQ tests; For instance Frey and Detterman (2004) reported a correlation of 0.82 between g (general intelligence factor) and SAT scores [52] another has found correlation of 0.81 between g and GCSE scores.[] Correlations between IQ scores (general cognitive ability) and achievement test scores are reported to be 0.81 by Deary and colleagues, with the explained variance ranging "from 58.6% in Mathematics and 48% in English to 18.1% in Art and Design".[]
Intelligence quotient
124
School performance
The American Psychological Association's report "Intelligence: Knowns and Unknowns" states that wherever it has been studied, children with high scores on tests of intelligence tend to learn more of what is taught in school than their lower-scoring peers. The correlation between IQ scores and grades is about .50. This means that the explained variance is 25%. Achieving good grades depends on many factors other than IQ, such as "persistence, interest in school, and willingness to study" (p.81).[25] It has been found IQ correlation with school performance depends on the IQ measurement used. For undergraduate students, the Verbal IQ as measured by WAIS-R has been found to correlate significantly (0.53) with the GPA of the last 60 hours. In contrast, Performance IQ correlation with the same GPA was only 0.22 in the same study.[53]
Job performance
According to Schmidt and Hunter, "for hiring employees without previous experience in the job the most valid predictor of future performance is general mental ability."[] The validity of IQ as a predictor of job performance is above zero for all work studied to date, but varies with the type of job and across different studies, ranging from 0.2 to 0.6.[] The correlations were higher when the unreliability of measurement methods was controlled for.[25] While IQ is more strongly correlated with reasoning and less so with motor function,[54] IQ-test scores predict performance ratings in all occupations.[] That said, for highly qualified activities (research, management) low IQ scores are more likely to be a barrier to adequate performance, whereas for minimally-skilled activities, athletic strength (manual strength, speed, stamina, and coordination) are more likely to influence performance.[] It is largely through the quicker acquisition of job-relevant knowledge that higher IQ mediates job performance. In establishing a causal direction to the link between IQ and work performance, longitudinal studies by Watkins and others suggest that IQ exerts a causal influence on future academic achievement, whereas academic achievement does not substantially influence future IQ scores.[55] Treena Eileen Rohde and Lee Anne Thompson write that general cognitive ability, but not specific ability scores, predict academic achievement, with the exception that processing speed and spatial ability predict performance on the SAT math beyond the effect of general cognitive ability.[56] The US military has minimum enlistment standards at about the IQ 85 level. There have been two experiments with lowering this to 80 but in both cases these men could not master soldiering well enough to justify their costs [57] Some US police departments have set a maximum IQ score for new officers (for example: 125, in New London, CT), under the argument that those with overly-high IQs will become bored and exhibit high turnover in the job. This policy has been challenged as discriminatory, but upheld by at least one US District court.[58] The American Psychological Association's report "Intelligence: Knowns and Unknowns" states that since the explained variance is 29%, other individual characteristics such as interpersonal skills, aspects of personality etc. are probably of equal or greater importance, but at this point there are no equally reliable instruments to measure them.[25]
Income
While it has been suggested that "in economic terms it appears that the IQ score measures something with decreasing marginal value. It is important to have enough of it, but having lots and lots does not buy you that much.",[59][60] large scale longitudinal studies indicate an increase in IQ translates into an increase in performance at all levels of IQ: i.e., that ability and job performance are monotonically linked at all IQ levels.[61] Charles Murray, coauthor of The Bell Curve, found that IQ has a substantial effect on income independently of family background.[62] The link from IQ to wealth is much less strong that than from IQ to job performance. Some studies indicate that IQ is unrelated to net worth.[63][64]
Intelligence quotient The American Psychological Association's 1995 report Intelligence: Knowns and Unknowns stated that IQ scores accounted for (explained variance) about quarter of the social status variance and one-sixth of the income variance. Statistical controls for parental SES eliminate about a quarter of this predictive power. Psychometric intelligence appears as only one of a great many factors that influence social outcomes.[25] Some studies claim that IQ only accounts for (explains) a sixth of the variation in income because many studies are based on young adults, many of whom have not yet reached their peak earning capacity, or even their education. On pg 568 of The g Factor, Arthur Jensen claims that although the correlation between IQ and income averages a moderate 0.4 (one sixth or 16% of the variance), the relationship increases with age, and peaks at middle age when people have reached their maximum career potential. In the book, A Question of Intelligence, Daniel Seligman cites an IQ income correlation of 0.5 (25% of the variance). A 2002 study[65] further examined the impact of non-IQ factors on income and concluded that an individual's location, inherited wealth, race, and schooling are more important as factors in determining income than IQ.
125
IQ and crime
The American Psychological Association's 1995 report Intelligence: Knowns and Unknowns stated that the correlation between IQ and crime was -0.2. It was -0.19 between IQ scores and number of juvenile offenses in a large Danish sample; with social class controlled, the correlation dropped to -0.17. A correlation of 0.20 means that the explained variance is less than 4%. It is important to realize that the causal links between psychometric ability and social outcomes may be indirect. Children with poor scholastic performance may feel alienated. Consequently, they may be more likely to engage in delinquent behavior, compared to other children who do well.[25] In his book The g Factor (1998), Arthur Jensen cited data which showed that, regardless of race, people with IQs between 70 and 90 have higher crime rates than people with IQs below or above this range, with the peak range being between 80 and 90. The 2009 Handbook of Crime Correlates stated that reviews have found that around eight IQ points, or 0.5 SD, separate criminals from the general population, especially for persistent serious offenders. It has been suggested that this simply reflects that "only dumb ones get caught" but there is similarly a negative relation between IQ and self-reported offending. That children with conduct disorder have lower IQ than their peers "strongly argues" for the theory.[66] A study of the relationship between US county-level IQ and US county-level crime rates found that higher average IQs were associated with lower levels of property crime, burglary, larceny rate, motor vehicle theft, violent crime, robbery, and aggravated assault. These results were not "confounded by a measure of concentrated disadvantage that captures the effects of race, poverty, and other social disadvantages of the county."[67]
Intelligence quotient A recent USA study connecting political views and intelligence has shown that the mean adolescent intelligence of young adults who identify themselves as "very liberal" is 106.4, while that of those who identify themselves as "very conservative" is 94.8.[71] Two other studies conducted in the UK reached similar conclusions.[72][73] There are also other correlations such as those between religiosity and intelligence and fertility and intelligence.
126
Real-life accomplishments Average adult combined IQs associated with real-life accomplishments by various tests[74][75]
Accomplishment MDs, JDs, or PhDs College graduates IQ 125+ 112 Test/study Year WAIS-R KAIT K-BIT 115 13 years of college 104 WAIS-R KAIT K-BIT 105110 WAIS-R Clerical and sales workers 100105 KAIT WAIS-R 97 13 years of high school (completed 911 years of school) 94 90 95 Semi-skilled workers (e.g. truck drivers, factory workers) Elementary school graduates (completed eighth grade) Elementary school dropouts (completed 07 years of school) Have 50/50 chance of reaching high school 9095 90 8085 75 K-BIT KAIT K-BIT WAIS-R 1987 2000 1992
Semi-skilled workers (operatives, service workers, including private household) 92 Unskilled workers 87
Intelligence quotient
127
Adults can harvest vegetables, repair furniture 60 Adults can do domestic work 50
There is considerable variation within and overlap between these categories. People with high IQs are found at all levels of education and occupational categories. The biggest difference occurs for low IQs with only an occasional college graduate or professional scoring below 90.[7]
Group differences
Among the most controversial issues related to the study of intelligence is the observation that intelligence measures such as IQ scores vary between ethnic and racial groups and sexes. While there is little scholarly debate about the existence of some of these differences, their causes remain highly controversial both within academia and in the public sphere.
Sex
Most IQ tests are constructed so that there are no overall score differences between females and males.[] Because environmental factors affect brain activity and behavior, where differences are found, it can be difficult for researchers to assess whether or not the differences are innate. Areas where differences have been found include verbal and mathematical ability.
Race
The 1996 Task Force investigation on Intelligence sponsored by the American Psychological Association concluded that there are significant variations in IQ across races.[25] The problem of determining the causes underlying this variation relates to the question of the contributions of "nature and nurture" to IQ. Psychologists such as Alan S. Kaufman[77] and Nathan Brody[78] and statisticians such as Bernie Devlin[79] argue that there are insufficient data to conclude that this is because of genetic influences. One of the most notable researchers arguing for a strong genetic influence on these average score differences is Arthur Jensen. In contrast, other researchers such as Richard Nisbett argue that environmental factors can explain all of the average group differences.[80]
Public policy
In the United States, certain public policies and laws regarding military service,[81] [82] education, public benefits,[83] capital punishment,[84] and employment incorporate an individual's IQ into their decisions. However, in the case of Griggs v. Duke Power Co. in 1971, for the purpose of minimizing employment practices that disparately impacted racial minorities, the U.S. Supreme Court banned the use of IQ tests in employment, except when linked to job performance via a Job analysis. Internationally, certain public policies, such as improving nutrition and prohibiting neurotoxins, have as one of their goals raising, or preventing a decline in, intelligence. A diagnosis of mental retardation is in part based on the results of IQ testing. Borderline intellectual functioning is a categorization where a person has below average cognitive ability (an IQ of 7185), but the deficit is not as severe as mental retardation (70 or below). In the United Kingdom, the eleven plus exam which incorporated an intelligence test has been used from 1945 to decide, at eleven years old, which type of school a child should go to. They have been much less used since the widespread introduction of comprehensive schools.
Intelligence quotient
128
Criticism of g
Some scientists dispute IQ entirely. In The Mismeasure of Man (1996), paleontologist Stephen Jay Gould criticized IQ tests and argued that that they were used for scientific racism. He argued that g was a mathematical artifact and criticized: ...the abstraction of intelligence as a single entity, its location within the brain, its quantification as one number for each individual, and the use of these numbers to rank people in a single series of worthiness, invariably to find that oppressed and disadvantaged groupsraces, classes, or sexesare innately inferior and deserve their status.(pp. 2425) Psychologist Peter Schnemann was also a persistent critic of IQ, calling it "the IQ myth". He argued that g is a flawed theory and that the high heritability estimates of IQ are based on false assumptions.[86][] Psychologist Arthur Jensen has rejected the criticism by Gould and also argued that even if g was replaced by a model with several intelligences this would change the situation less than expected. All tests of cognitive ability would continue to be highly correlated with one another and there would still be a black-white gap on cognitive tests.[2]
Test bias
The American Psychological Association's report Intelligence: Knowns and Unknowns stated that in the United States IQ tests as predictors of social achievement are not biased against African Americans since they predict future performance, such as school achievement, similarly to the way they predict future performance for Caucasians.[25] However, IQ tests may well be biased when used in other situations. A 2005 study stated that "differential validity in prediction suggests that the WAIS-R test may contain cultural influences that reduce the validity of the WAIS-R as a measure of cognitive ability for Mexican American students,"[87] indicating a weaker positive correlation relative to sampled white students. Other recent studies have questioned the culture-fairness of IQ tests when used in South Africa.[88][89] Standard intelligence tests, such as the Stanford-Binet, are often inappropriate for children with autism; the alternative of using developmental or adaptive skills measures are relatively poor measures of intelligence in autistic children, and may have resulted in incorrect claims that a majority of children with autism are mentally retarded.[90]
Intelligence quotient
129
Outdated methodology
A 2006 article stated that contemporary psychologic research often did not reflect substantial recent developments in psychometrics and "bears an uncanny resemblance to the psychometric state of the art as it existed in the 1950s."[91]
Dynamic assessment
Notable and increasingly influential[93][94] alternative to the wide range of standard IQ tests originated in the writings of psychologist Lev Vygotsky (1896-1934) of his most mature and highly productive period of 1932-1934. The notion of the zone of proximal development that he introduced in 1933, roughly a year before his death, served as the banner for his proposal to diagnose development as the level of actual development that can be measured by the child's independent problem solving and, at the same time, the level of proximal, or potential development that is measured in the situation of moderately assisted problem solving by the child.[95] The maximum level of complexity and difficulty of the problem that the child is capable to solve under some guidance indicates the level of potential development. Then, the difference between the higher level of potential and the lower level of actual development indicates the zone of proximal development. Combination of the two indexesthe level of actual and the zone of the proximal developmentaccording to Vygotsky, provides a significantly more informative indicator of psychological development than the assessment of the level of actual development alone.[96][97] The ideas on the zone of development were later developed in a number of psychological and educational theories and practices. Most notably, they were developed under the banner of dynamic assessment that focuses on the testing of learning and developmental potential[98][99][100] (for instance, in the work of Reuven Feuerstein and his associates,[101] who has criticized standard IQ testing for its putative assumption or acceptance of "fixed and
Intelligence quotient immutable" characteristics of intelligence or cognitive functioning). Grounded in developmental theories of Vygotsky and Feuerstein, who recognized that human beings are not static entities but are always in states of transition and transactional relationships with the world, dynamic assessment received also considerable support in the recent revisions of cognitive developmental theory by Joseph Campione, Ann Brown, and John D. Bransford and in theories of multiple intelligences by Howard Gardner and Robert Sternberg.[102]
130
High IQ societies
There are social organizations, some international, which limit membership to people who have scores as high as or higher than the 98th percentile on some IQ test or equivalent. Mensa International is perhaps the best known of these. There are other groups requiring a score above the 98th percentile.
Reference charts
IQ reference charts are tables suggested by test publishers to divide intelligence ranges in various categories.
References
Notes
[1] http:/ / icd9cm. chrisendres. com/ index. php?srchtype=procs& srchtext=94. 01& Submit=Search& action=search [2] http:/ / www. nlm. nih. gov/ medlineplus/ ency/ article/ 001912. htm [4] Intelligence: Knowns and Unknowns (http:/ / www. gifted. uconn. edu/ siegle/ research/ Correlation/ Intelligence. pdf) (Report of a Task Force established by the Board of Scientific Affairs of the American Psychological Association, Released August 7, 1995a slightly edited version was published in American Psychologist: ) [7] IQ Testing 101, Alan S. Kaufman, 2009, Springer Publishing Company, ISBN 0-8261-0629-3 ISBN 978-0-8261-0629-2 [9] Larson, Edward J. (1995). Sex, Race, and Science: Eugenics in the Deep South. Baltimore: Johns Hopkins University Press. pp. 74. [20] S.E. Embretson & S.P.Reise: Item response theory for psychologists, 2000. "...for many other psychological tests, normal distributions are achieved by normalizing procedures. For example, intelligence tests..." Found on: http:/ / books. google. se/ books?id=rYU7rsi53gQC& pg=PA29& lpg=PA29& dq=%22intelligence+ tests%22+ normalize& source=bl& ots=ZAIQEgaa6Q& sig=q-amDaZqx7Ix6mMkvIDMnj9M9O0& hl=sv& ei=lEEJTNqSIYWMOPqLuRE& sa=X& oi=book_result& ct=result& resnum=7& ved=0CEIQ6AEwBg#v=onepage& q& f=false [28] International Journal of Epidemiology, Volume 35, Issue 3, June 2006. See reprint of Leowontin's 1974 article "The analysis of variance and the analysis of causes" and 2006 commentaries: http:/ / ije. oxfordjournals. org/ content/ 35/ 3. toc [31] (http:/ / www. wjh. harvard. edu/ ~cfc/ Chabris2012a-FalsePositivesGenesIQ. pdf) [36] C. F. Chabris, B. M. Hebert, D. J. Benjamin, J. P. Beauchamp, D. Cesarini, M. J. H. M. van der Loos, M. Johannesson, P. K. E. Magnusson, P. Lichtenstein, C. S. Atwood, J. Freese, T. S. Hauser, R. M. Hauser, N. A. Christakis and D. I. Laibson. (2011). Most reported genetic associations with general intelligence are probably false positives. Psychological Science [37] B. Benyamin, B. Pourcain, O. S. Davis, G. Davies, N. K. Hansell, M. J. Brion, R. M. Kirkpatrick, R. A. Cents, S. Franic, M. B. Miller, C. M. Haworth, E. Meaburn, T. S. Price, D. M. Evans, N. Timpson, J. Kemp, S. Ring, W. McArdle, S. E. Medland, J. Yang, S. E. Harris, D. C. Liewald, P. Scheet, X. Xiao, J. J. Hudziak, E. J. de Geus, C. Wellcome Trust Case Control, V. W. Jaddoe, J. M. Starr, F. C. Verhulst, C. Pennell, H. Tiemeier, W. G. Iacono, L. J. Palmer, G. W. Montgomery, N. G. Martin, D. I. Boomsma, D. Posthuma, M. McGue, M. J. Wright, G. Davey Smith, I. J. Deary, R. Plomin and P. M. Visscher. (2013). Childhood intelligence is heritable, highly polygenic and associated with FNBP1L. Mol Psychiatry [38] E. M. Tucker-Drob, M. Rhemtulla, K. P. Harden, E. Turkheimer and D. Fask. (2011). Emergence of a Gene x Socioeconomic Status Interaction on Infant Mental Ability Between 10 Months and 2 Years. Psychological Science, 22, 125-33 (http:/ / dx. doi. org/ 10. 1177/ 0956797610392926) [40] K. P. Harden, E. Turkheimer and J. C. Loehlin. (2005). Genotype environment interaction in adolescents' cognitive ability. Behavior Genetics, 35, (http:/ / dx. doi. org/ 804-804) [48] C. Stough, B. Kerkin, T. C. Bates and G. Mangan. (1994). Music and spatial IQ. Personality & Individual Differences, 17, (http:/ / dx. doi. org/ 695) [49] C. F. Chabris. (1999). Prelude or requiem for the 'Mozart effect'? Nature, 400, author reply 827-828 (http:/ / dx. doi. org/ 826-827;) [57] Gottfredson, L. S. (2006). Social consequences of group differences in cognitive ability (Consequencias sociais das diferencas de grupo em habilidade cognitiva). In C. E. Flores-Mendoza & R. Colom (Eds.), Introducau a psicologia das diferencas individuais (pp. 433-456). Porto Allegre, Brazil: ArtMed Publishers. [58] ABC News, "Court OKs Barring High IQs for Cops", http:/ / abcnews. go. com/ US/ story?id=95836 [59] Detterman and Daniel, 1989.
Intelligence quotient
[64] http:/ / www. sciencedaily. com/ releases/ 2007/ 04/ 070424204519. htm [66] Handbook of Crime Correlates; Lee Ellis, Kevin M. Beaver, John Wright; 2009; Academic Press [70] Rowe, D. C., W. J. Vesterdal, and J. L. Rodgers, "The Bell Curve Revisited: How Genes and Shared Environment Mediate IQ-SES Associations," University of Arizona, 1997 [74] Kaufman 2009, p.126. [76] Kaufman 2009, p.132. [85] The Waning of I.Q. (http:/ / select. nytimes. com/ 2007/ 09/ 14/ opinion/ 14brooks. html) by David Brooks, The New York Times [86] Psychometrics of Intelligence. K. Kemp-Leonard (ed.) Encyclopedia of Social Measurement, 3, 193-201: (http:/ / www2. psych. purdue. edu/ ~phs/ pdf/ 89. pdf) [93] Mindes, G. Assessing young children (http:/ / books. google. ca/ books?id=x41LAAAAYAAJ& q=dynamic+ assessment+ popularity#search_anchor). Merrill/Prentice Hall, 2003, p. 158 [94] Haywood, H. Carl & Lidz, Carol Schneider. Dynamic Assessment in Practice: Clinical And Educational Applications (http:/ / books. google. ca/ books?id=xQekS_oqGzoC& q=rapid+ growth+ of+ interest+ + in+ this+ topic#v=snippet& q=rapid growth of interest in this topic& f=false). Cambridge University Press, 2006, p. 1 [95] Vygotsky, L.S. (19332-34/1997). The Problem of Age (http:/ / www. marxists. org/ archive/ vygotsky/ works/ 1934/ problem-age. htm). in The Collected Works of L. S. Vygotsky, Volume 5, 1998, pp. 187-205 [96] Chaiklin, S. (2003). "The Zone of Proximal Development in Vygotsky's analysis of learning and instruction." In Kozulin, A., Gindis, B., Ageyev, V. & Miller, S. (Eds.) Vygotsky's educational theory and practice in cultural context. 39-64. Cambridge: Cambridge University [97] Zaretskii,V.K. (2009). The Zone of Proximal Development What Vygotsky Did Not Have Time to Write. Journal of Russian and East European Psychology, vol. 47, no. 6, NovemberDecember 2009, pp. 7093 [98] Sternberg, R.S. & Grigorenko, E.L. (2001). All testing is dynamic testing. Issues in Education, 7(2), 137-170 [99] Sternberg, R.J. & Grigorenko, E.L. (2002). Dynamic testing: The nature and measurement of learning potential. Cambridge (UK): University of Cambridge [100] Haywood, C.H. & Lidz, C.S. (2007). Dynamic assessment in practice: Clinical and educational applications. New York: Cambridge University Press [101] Feuerstein, R., Feuerstein, S., Falik, L & Rand, Y. (1979; 2002). Dynamic assessments of cognitive modifiability. ICELP Press, Jerusalem: Israel [102] Dodge, Kenneth A. Foreword, xiii-xv. In Haywood, H. Carl & Lidz, Carol Schneider. Dynamic Assessment in Practice: Clinical And Educational Applications. Cambridge University Press, 2006, p.xiii-xiv
131
Further reading Carroll, J.B. (1993). Human cognitive abilities: A survey of factor-analytical studies. New York: Cambridge University Press. ISBN0-521-38275-0. Lahn, Bruce T.; Ebenstein, Lanny (2009). "Let's celebrate human genetic diversity". Nature 461 (7265): 7268. doi: 10.1038/461726a (http://dx.doi.org/10.1038/461726a). PMID 19812654 (http://www.ncbi.nlm.nih. gov/pubmed/19812654). Coward, W. Mark; Sackett, Paul R. (1990). "Linearity of ability^performance relationships: A reconfirmation". Journal of Applied Psychology 75 (3): 297300. doi: 10.1037/0021-9010.75.3.297 (http://dx.doi.org/10.1037/ 0021-9010.75.3.297). Duncan, J.; Seitz, RJ; Kolodny, J; Bor, D; Herzog, H; Ahmed, A; Newell, FN; Emslie, H (2000). "A Neural Basis for General Intelligence". Science 289 (5478): 45760. doi: 10.1126/science.289.5478.457 (http://dx.doi.org/ 10.1126/science.289.5478.457). PMID 10903207 (http://www.ncbi.nlm.nih.gov/pubmed/10903207). Duncan, John; Burgess, Paul; Emslie, Hazel (1995). "Fluid intelligence after frontal lobe lesions". Neuropsychologia 33 (3): 2618. doi: 10.1016/0028-3932(94)00124-8 (http://dx.doi.org/10.1016/ 0028-3932(94)00124-8). PMID 7791994 (http://www.ncbi.nlm.nih.gov/pubmed/7791994). Flynn, James R. (1999). "Searching for justice: The discovery of IQ gains over time" (http://www.stat. columbia.edu/~gelman/stuff_for_blog/flynn.pdf). American Psychologist 54 (1): 520. doi: 10.1037/0003-066X.54.1.5 (http://dx.doi.org/10.1037/0003-066X.54.1.5). Frey, Meredith C.; Detterman, Douglas K. (2004). "Scholastic Assessment org?". Psychological Science 15 (6): 3738. doi: 10.1111/j.0956-7976.2004.00687.x (http://dx.doi.org/10.1111/j.0956-7976.2004.00687.x). PMID 15147489 (http://www.ncbi.nlm.nih.gov/pubmed/15147489). Gale, C. R; Deary, I. J; Schoon, I.; Batty, G D.; Batty, G D. (2006). "IQ in childhood and vegetarianism in adulthood: 1970 British cohort study" (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1790759). BMJ 334 (7587): 245. doi: 10.1136/bmj.39030.675069.55 (http://dx.doi.org/10.1136/bmj.39030.675069.55). PMC
Intelligence quotient 1790759 (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1790759). PMID 17175567 (http://www.ncbi. nlm.nih.gov/pubmed/17175567). Gottfredson, L (1997). "Why g matters: The complexity of everyday life" (http://www.udel.edu/educ/ gottfredson/reprints/1997whygmatters.pdf). Intelligence 24 (1): 79132. doi: 10.1016/S0160-2896(97)90014-3 (http://dx.doi.org/10.1016/S0160-2896(97)90014-3). Gottfredson, Linda S. (1998). "The general intelligence factor" (http://www.udel.edu/educ/gottfredson/ reprints/1998generalintelligencefactor.pdf) (PDF). Scientific American Presents 9 (4): 2429. Gottfredson, L.S. (2005). "Suppressing intelligence research: Hurting those we intend to help." (http://www. udel.edu/educ/gottfredson/reprints/2005suppressingintelligence.pdf) (PDF). In Wright, R.H. and Cummings, N.A (Eds.). Destructive trends in mental health: The well-intentioned path to harm. New York: Taylor and Francis. pp.155186. ISBN0-415-95086-4. Gottfredson, L.S. (2006). "Social consequences of group differences in cognitive ability (Consequencias sociais das diferencas de grupo em habilidade cognitiva)" (http://www.udel.edu/educ/gottfredson/reprints/ 2004socialconsequences.pdf) (PDF). In Flores-Mendoza, C.E. and Colom, R. (Eds.). Introduo psicologia das diferenas individuais. Porto Alegre, Brazil: ArtMed Publishers. pp.155186. ISBN85-363-0621-1. Gould, S.J. (1996). In W. W. Norton & Co. The Mismeasure of Man: Revised and Expanded Edition. New-York: Penguin. ISBN0-14-025824-8.
132
Gray, Jeremy R.; Chabris, Christopher F.; Braver, Todd S. (2003). "Neural mechanisms of general fluid intelligence". Nature Neuroscience 6 (3): 31622. doi: 10.1038/nn1014 (http://dx.doi.org/10.1038/nn1014). PMID 12592404 (http://www.ncbi.nlm.nih.gov/pubmed/12592404). Gray, Jeremy R.; Thompson, Paul M. (2004). "Neurobiology of intelligence: science and ethics". Nature Reviews Neuroscience 5 (6): 47182. doi: 10.1038/nrn1405 (http://dx.doi.org/10.1038/nrn1405). PMID 15152197 (http://www.ncbi.nlm.nih.gov/pubmed/15152197). Haier, R; Jung, R; Yeo, R; Head, K; Alkire, M (2005). "The neuroanatomy of general intelligence: sex matters". NeuroImage 25 (1): 3207. doi: 10.1016/j.neuroimage.2004.11.019 (http://dx.doi.org/10.1016/j.neuroimage. 2004.11.019). PMID 15734366 (http://www.ncbi.nlm.nih.gov/pubmed/15734366). Harris, J.R. (1998). The Nurture Assumption: why children turn out the way they do. New York (NY): Free Press. ISBN0-684-84409-5. Hunt, Earl (2001). "Multiple Views of Multiple Intelligence". PsycCRITIQUES 46 (1): 57. doi: 10.1037/002513 (http://dx.doi.org/10.1037/002513). Jensen, A.R. (1979). Bias in mental testing. New York (NY): Free Press. ISBN0-02-916430-3. Jensen, A.R. (1979). The g Factor: The Science of Mental Ability. Wesport (CT): Praeger Publishers. ISBN0-275-96103-6. Jensen, A.R. (2006). Clocking the Mind: Mental Chronometry and Individual Differences. Elsevier. ISBN0-08-044939-5. Kaufman, Alan S. (2009). IQ Testing 101. New York (NY): Springer Publishing. ISBN978-0-8261-0629-2. Klingberg, Torkel; Forssberg, Hans; Westerberg, Helena (2002). "Training of Working Memory in Children With ADHD". Journal of Clinical and Experimental Neuropsychology (Neuropsychology, Development and Cognition: Section A) 24 (6): 78191. doi: 10.1076/jcen.24.6.781.8395 (http://dx.doi.org/10.1076/jcen.24.6.781.8395). PMID 12424652 (http://www.ncbi.nlm.nih.gov/pubmed/12424652). McClearn, G. E.; Johansson, B; Berg, S; Pedersen, NL; Ahern, F; Petrill, SA; Plomin, R (1997). "Substantial Genetic Influence on Cognitive Abilities in Twins 80or More Years Old". Science 276 (5318): 15603. doi: 10.1126/science.276.5318.1560 (http://dx.doi.org/10.1126/science.276.5318.1560). PMID 9171059 (http:/ /www.ncbi.nlm.nih.gov/pubmed/9171059). Mingroni, M (2004). "The secular rise in IQ: Giving heterosis a closer look". Intelligence 32 (1): 6583. doi: 10.1016/S0160-2896(03)00058-8 (http://dx.doi.org/10.1016/S0160-2896(03)00058-8).
Intelligence quotient Murray, C. (1998). Income Inequality and IQ (http://www.aei.org/docLib/20040302_book443.pdf) (PDF). Washington (DC): AEI Press. ISBN0-8447-7094-9. Noguera, P.A (2001). "Racial politics and the elusive quest for excellence and equity in education" (http://www. inmotionmagazine.com/er/pnrp1.html). Motion Magazine. Article # ER010930002. Plomin, R.; DeFries, J.C.; Craig, I.W.; McGuffin, P (2003). Behavioral genetics in the postgenomic era. Washington (DC): American Psychological Association. ISBN1-55798-926-5. Plomin, R.; DeFries, J.C.; McClearn, G.E.; McGuffin, P (2000). Behavioral genetics (4th ed.). New York (NY): Worth Publishers. ISBN0-7167-5159-3. Rowe, D.C.; Vesterdal, W.J.; Rodgers, J.L. (1997). The Bell Curve Revisited: How Genes and Shared Environment Mediate IQ-SES Associations.Wikipedia:Verifiability Schoenemann, P Thomas; Sheehan, Michael J; Glotzer, L Daniel (2005). "Prefrontal white matter volume is disproportionately larger in humans than in other primates". Nature Neuroscience 8 (2): 24252. doi: 10.1038/nn1394 (http://dx.doi.org/10.1038/nn1394). PMID 15665874 (http://www.ncbi.nlm.nih.gov/ pubmed/15665874). Shaw, P.; Greenstein, D.; Lerch, J.; Clasen, L.; Lenroot, R.; Gogtay, N.; Evans, A.; Rapoport, J. et al. (2006). "Intellectual ability and cortical development in children and adolescents". Nature 440 (7084): 6769. doi: 10.1038/nature04513 (http://dx.doi.org/10.1038/nature04513). PMID 16572172 (http://www.ncbi.nlm. nih.gov/pubmed/16572172). Tambs, Kristian; Sundet, Jon Martin; Magnus, Per; Berg, Kre (1989). "Genetic and environmental contributions to the covariance between occupational status, educational attainment, and IQ: A study of twins". Behavior Genetics 19 (2): 20922. doi: 10.1007/BF01065905 (http://dx.doi.org/10.1007/BF01065905). PMID 2719624 (http://www.ncbi.nlm.nih.gov/pubmed/2719624). Thompson, Paul M.; Cannon, Tyrone D.; Narr, Katherine L.; Van Erp, Theo; Poutanen, Veli-Pekka; Huttunen, Matti; Lnnqvist, Jouko; Standertskjld-Nordenstam, Carl-Gustaf et al. (2001). "Genetic influences on brain structure". Nature Neuroscience 4 (12): 12538. doi: 10.1038/nn758 (http://dx.doi.org/10.1038/nn758). PMID 11694885 (http://www.ncbi.nlm.nih.gov/pubmed/11694885). Wechsler, D. (1997). Wechsler Adult Intelligence Scale (3rd ed.). San Antonia (TX): The Psychological Corporation. Wechsler, D. (2003). Wechsler Intelligence Scale for Children (4th ed.). San Antonia (TX): The Psychological Corporation. Weiss, Volkmar (2009). "National IQ means transformed from Programme for International Student Assessment (PISA) Scores" (http://mpra.ub.uni-muenchen.de/14600/). The Journal of Social, Political and Economic Studies 31 (1): 7194.
133
External links
Human Intelligence: biographical profiles, current controversies, resources for teachers (http://www. intelltheory.com/) Classics in the History of Psychology (http://psychclassics.yorku.ca/)
Internal consistency
134
Internal consistency
In statistics and research, internal consistency is typically a measure based on the correlations between different items on the same test (or the same subscale on a larger test). It measures whether several items that propose to measure the same general construct produce similar scores. For example, if a respondent expressed agreement with the statements "I like to ride bicycles" and "I've enjoyed riding bicycles in the past", and disagreement with the statement "I hate bicycles", this would be indicative of good internal consistency of the test.
Cronbach's alpha
Internal consistency is usually measured with Cronbach's alpha, a statistic calculated from the pairwise correlations between items. Internal consistency ranges between zero and one. A commonly accepted rule of thumb for describing internal consistency is as follows:[1]
Cronbach's alpha Internal consistency .9 .9 > .8 .8 > .7 .7 > .6 .6 > .5 .5 > Excellent Good Acceptable Questionable Poor Unacceptable
Very high reliabilities (0.95 or higher) are not necessarily desirable, as this indicates that the items may be entirely redundant. [2] The goal in designing a reliable instrument is for scores on similar items to be related (internally consistent), but for each to contribute some unique information as well. An alternative way of thinking about internal consistency is that it is the extent to which all of the items of a test measure the same latent variable. The advantage of this perspective over the notion of a high average correlation among the items of a test - the perspective underlying Cronbach's alpha - is that the average item correlation is affected by skewness (in the distribution of item correlations) just as any other average is. Thus, whereas the modal item correlation is zero when the items of a test measure several unrelated latent variables, the average item correlation in such cases will be greater than zero. Thus, whereas the ideal of measurement is for all items of a test to measure the same latent variable, alpha has been demonstrated many times to attain quite high values even when the set of items measures several unrelated latent variables.[3][4][5][6][7][8] The hierarchical "Coefficient omega" may be a more appropriate index of the extent to which all of the items in a test measure the same latent variable.[9][10] Several different measures of internal consistency are reviewed by Revelle & Zinbarg (2009).[11]
Internal consistency
135
References
[1] George, D., & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference. 11.0 update (4th ed.). Boston: Allyn & Bacon. [2] Streiner, D. L. (2003) Starting at the beginning: an introduction to coefficient alpha and internal consistency, Journal of Personality Assessment, 80, 99-103 [3] Cortina. J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98104. [4] Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297334. [5] Green, S. B., Lissitz, R.W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37, 827838. [6] Revelle, W. (1979). Hierarchical cluster analysis and the internal structure of tests. Multivariate Behavioral Research, 14, 5774. [7] Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350353. [8] Zinbarg, R., Yovel, I., Revelle, W. & McDonald, R. (2006). Estimating generalizability to a universe of indicators that all have an attribute in common: A comparison of estimators for . Applied Psychological Measurement, 30, 121144. [9] McDonald, R. P. (1999). Test theory: A unified treatment. Psychology Press. ISBN 0-8058-3075-8 [10] Zinbarg, R., Revelle, W., Yovel, I. & Li, W. (2005). Cronbachs , Revelles , and McDonalds : Their relations with each other and two alternative conceptualizations of reliability. Psychometrika, 70, 123133. [11] Revelle, W., Zinbarg, R. (2009) "Coefficients Alpha, Beta, Omega, and the glb: Comments on Sijtsma", Psychometrika, 74(1), 145154. (http:/ / dx. doi. org/ 10. 1007/ s11336-008-9102-z)
External links
http://www.wilderdom.com/personality/L3-2EssentialsGoodPsychologicalTest.html
Intra-rater reliability
In statistics, intra-rater reliability is the degree of agreement among multiple repetitions of a diagnostic test performed by a single rater.[1][2]
References
[1] Stroke Engine glossary (McGill Faculty of Medicine) (http:/ / www. medicine. mcgill. ca/ strokengine-assess/ definitions-en. html) [2] Outcomes database glossary (http:/ / www. outcomesdatabase. org/ show/ category/ id/ 8)
IPPQ
136
IPPQ
The iOpener People and Performance Questionnaire (iPPQ) is a psychometric tool, designed to assess workplace happiness and wellbeing. It is designed and administered by iOpener Ltd, a management consultancy firm based in Oxford, UK.
Happiness at work
Despite a large body of positive psychological research into the relationship between happiness and productivity,[1][2][3] and the development of corporate psychometric tools to asses factors such as personality profile and feedback (e.g. 360 feedback), the two fields of study have never previously been combined to produce a psychometric tool specifically designed to measure happiness in the workplace. The iPPQ is the first and only example of this type of tool to date.
Research
The tool was developed following the development of a model of workplace happiness[4] and research into the relationships between employee happiness, overtime, sick leave and intention to stay or leave,[5] conducted by Dr Laurel Edmunds and Jessica Pryce-Jones. In addition to the academic articles cited above, iOpener's research into happiness at work has received widespread press coverage from publications including The Sunday Times,[6] Jobsite,[7] Legal Week[8] and Construction Today.[9]
References
[1] Carr, A.: "Positive Psychology: The Science of Happiness and Human Strengths" Hove, Brunner-Routledge 2004 [2] Isen, A.; Positive Affect and Decision-making. In M. Lewis and J. Haviland Jones (eds), "Handbook of Emotions" (2nd edition), pp. 417-436. New York, Guilford Press 2000 [3] Buss, D. The Evolution of Happiness, "American Psychologist" Vol. 55 (2000) pp. 15-23 [4] Dutton V.M., Edmunds L.D.: A model of workplace happiness, Selection & Development Review, Vol. 23, No.1, 2007 [5] Relationships between employee happiness, overtime, sick leave and intention to stay or leave, Selection & Development Review, Vol. 24, No.2, 2008 (http:/ / www. iopener. co. uk/ wsc_content/ download/ sdr2008paper. pdf) [6] Make sure people are happy in their job, The Sunday Times 25/06/08 (http:/ / business. timesonline. co. uk/ tol/ business/ career_and_jobs/ recruiter_forum/ article3998244. ece) [7] How to be Happy at Work, Jobsite 02/04/2009 (http:/ / www. jobsite. co. uk/ cgi-bin/ bulletin_search. cgi?act=da& aid=1782) [8] The pursuit of happiness, Legal Week 13/11/2008 (http:/ / www. legalweek. com/ Articles/ 1180002/ The+ pursuit+ of+ happiness. html) [9] Increaseing Employee Morale, Construction Today 15/10/2008 (http:/ / www. ct-europe. com/ article-page. php?contentid=6290& issueid=218)
External links
iOpener homepage (http://www.iopener.com/) Take the iPPQ online for free (http://www.smart-survey.co.uk/v.asp?i=5427fbrin)
Item bank
137
Item bank
An item bank is a term for a repository of test items that belong to a testing program, as well as all information pertaining to those items. In most applications of testing and assessment, the items are of multiple choice format, but any format can be used. Items are pulled from the bank and assigned to test forms for publication either as a paper-and-pencil test or some form of e-assessment.
Types of information
An item bank will not only include the text of each item, but also extensive information regarding test development and psychometric characteristics of the items. Examples of such information include:[1] Item author Date written Item status (e.g., new, pilot, active, retired) Angoff ratings Correct answer Item format
Classical test theory statistics Item response theory statistics User-defined fields
References
[1] Vale, C.D. (2004). Computerized item banking. In Downing, S.D., & Haladyna, T.M. (Eds.) The Handbook of Test Development. Routledge.
138
Overview
The concept of the item response function was around before 1950. The pioneering work of IRT as a theory occurred during the 1950s and 1960s. Three of the pioneers were the Educational Testing Service psychometrician Frederic M. Lord,[2] the Danish mathematician Georg Rasch, and Austrian sociologist Paul Lazarsfeld, who pursued parallel research independently. Key figures who furthered the progress of IRT include Benjamin Drake Wright and David Andrich. IRT did not become widely used until the late 1970s and 1980s, when personal computers gave many researchers access to the computing power necessary for IRT. Among other things, the purpose of IRT is to provide a framework for evaluating how well assessments work, and how well individual items on assessments work. The most common application of IRT is in education, where psychometricians use it for developing and refining exams, maintaining banks of items for exams, and equating for the difficulties of successive versions of exams (for example, to allow comparisons between results over time).[3] IRT models are often referred to as latent trait models. The term latent is used to emphasize that discrete item responses are taken to be observable manifestations of hypothesized traits, constructs, or attributes, not directly observed, but which must be inferred from the manifest responses. Latent trait models were developed in the field of sociology, but are virtually identical to IRT models. IRT is generally regarded as an improvement over classical test theory (CTT). For tasks that can be accomplished using CTT, IRT generally brings greater flexibility and provides more sophisticated information. Some applications, such as computerized adaptive testing, are enabled by IRT and cannot reasonably be performed using only classical
Item response theory test theory. Another advantage of IRT over CTT is that the more sophisticated information IRT provides allows a researcher to improve the reliability of an assessment. IRT entails three assumptions: 1. A unidimensional trait denoted by ; 2. Local independence of items; 3. The response of a person to an item can be modeled by a mathematical item response function (IRF). The trait is further assumed to be measurable on a scale (the mere existence of a test assumes this), typically set to a standard scale with a mean of 0.0 and a standard deviation of 1.0. 'Local independence' means that items are not related except for the fact that they measure the same trait, which is equivalent to the assumption of unidimensionality, but presented separately because multidimensionality can be caused by other issues. The topic of dimensionality is often investigated with factor analysis, while the IRF is the basic building block of IRT and is the center of much of the research and literature.
139
where
is
the ,
person , and
parameter and
item parameters. The item parameters simply determine the shape of the IRF and in some cases have a direct interpretation. The figure to the right depicts an example of the 3PL model of the ICC with an overlaid conceptual explanation of the parameters. The item parameters can be interpreted as changing the shape of the standard logistic function:
In brief, the parameters are interpreted as follows (dropping subscripts for legibility); b is most basic, hence listed first: b difficulty, item location: the half-way point between (min) and 1 (max), also where
the slope is maximized. a discrimination, scale, slope: the maximum slope c pseudo-guessing, chance, asymptotic minimum If then these simplify to and meaning that b equals the 50% success level
(difficulty), and a (divided by four) is the maximum slope (discrimination), which occurs at the 50% success level.
Item response theory Further, the logit (log odds) of a correct response is (assuming
140
even odds (1:1, so logit 0) of a correct answer, the greater the ability is above (or below) the difficulty the more (or less) likely a correct response, with discrimination a determining how rapidly the odds increase or decrease with ability. In words, the standard logistic function has an asymptotic minimum of 0 ( ), is centered around 0 ( , ), and has maximum slope parameter shifts the horizontal scale, and the The parameter stretches the horizontal scale, the to This is compresses the vertical scale from
elaborated below. The parameter represents the item location which, in the case of attainment testing, is referred to as the item difficulty. It is the point on minimum value of where the IRF has its maximum slope, and where the value is half-way between the =0.0, which and the maximum value of 1. The example item is of medium difficulty since
is near the center of the distribution. Note that this model scales the item's difficulty and the person's trait onto the same continuum. Thus, it is valid to talk about an item being about as hard as Person A's trait level or of a person's trait level being about the same as Item Y's difficulty, in the sense that successful performance of the task involved with an item reflects a specific level of ability. The item parameter represents the discrimination of the item: that is, the degree to which the item discriminates between persons in different regions on the latent continuum. This parameter characterizes the slope of the IRF where the slope is at its maximum. The example item has =1.0, which discriminates fairly well; persons with low ability do indeed have a much smaller chance of correctly responding than persons of higher ability. For items such as multiple choice items, the parameter is used in attempt to account for the effects of guessing on the probability of a correct response. It indicates the probability that very low ability individuals will get this item correct by chance, mathematically represented as a lower asymptote. A four-option multiple choice item might have an IRF like the example item; there is a 1/4 chance of an extremely low ability candidate guessing the correct answer, so the would be approximately 0.25. This approach assumes that all options are equally plausible, because if one option made no sense, even the lowest ability person would be able to discard it, so IRT parameter estimation methods take this into account and estimate a based on the observed data.[4]
IRT models
Broadly speaking, IRT models can be divided into two families: unidimensional and multidimensional. Unidimensional models require a single trait (ability) dimension . Multidimensional IRT models model response data hypothesized to arise from multiple traits. However, because of the greatly increased complexity, the majority of IRT research and applications utilize a unidimensional model. IRT models can also be categorized based on the number of scored responses. The typical multiple choice item is dichotomous; even though there may be four or five options, it is still scored only as correct/incorrect (right/wrong). Another class of models apply to polytomous outcomes, where each response has a different score value.[5][6] A common example of this are Likert-type items, e.g., "Rate on a scale of 1 to 5."
Item response theory models are sample independent, a property that does not hold for two-parameter and three-parameter models. Additionally, there is theoretically a four-parameter model (4PL), with an upper asymptote, denoted by where 3PL is replaced by match their practical or psychometric importance; the location/difficulty (
141
in the
. However, this is rarely used. Note that the alphabetical order of the item parameters does not ) parameter is clearly most important .
because it is included in all three models. The 1PL uses only , the 2PL uses and , the 3PL adds , and the 4PL adds The 2PL is equivalent to the 3PL model with , and is appropriate for testing items where guessing the correct answer is highly unlikely, such as fill-in-the-blank items ("What is the square root of 121?"), or where the concept of guessing does not apply, such as personality, attitude, or interest items (e.g., "I like Broadway musicals. Agree/Disagree"). The 1PL assumes not only that guessing is not present (or irrelevant), but that all items are equivalent in terms of discrimination, analogous to a common factor analysis with identical loadings for all items. Individual items or individuals might have secondary factors but these are assumed to be mutually independent and collectively orthogonal.
where is the cumulative distribution function (cdf) of the standard normal distribution. The normal-ogive model derives from the assumption of normally distributed measurement error and is theoretically appealing on that basis. Here is, again, the difficulty parameter. The discrimination parameter is , the standard deviation of the measurement error for item i, and comparable to 1/ . One can estimate a normal-ogive latent trait model by factor-analyzing a matrix of tetrachoric correlations between items.[8] This means it is technically possible to estimate a simple IRT model using general-purpose statistical software. With rescaling of the ability parameter, it is possible to make the 2PL logistic model closely approximate the cumulative normal ogive. Typically, the 2PL logistic and normal-ogive IRFs differ in probability by no more than 0.01 across the range of the function. The difference is greatest in the distribution tails, however, which tend to have more influence on results. The latent trait/IRT model was originally developed using normal ogives, but this was considered too computationally demanding for the computers at the time (1960s). The logistic model was proposed as a simpler alternative, and has enjoyed wide use since. More recently, however, it was demonstrated that, using standard polynomial approximations to the normal cdf,[9] the normal-ogive model is no more computationally demanding than logistic models.[10]
Item response theory confirm to the model, before claims regarding the presence of a latent trait can be considered valid. Therefore, under Rasch models, misfitting responses require diagnosis of the reason for the misfit, and may be excluded from the data set if substantive explanations can be made that they do not address the latent trait.[14] Thus, the Rasch approach can be seen to be a confirmatory approach, as opposed to exploratory approaches that attempt to model the observed data. As in any confirmatory analysis, care must be taken to avoid confirmation bias. The presence or absence of a guessing or pseudo-chance parameter is a major and sometimes controversial distinction. The IRT approach includes a left asymptote parameter to account for guessing in multiple choice examinations, while the Rasch model does not because it is assumed that guessing adds randomly distributed noise to the data. As the noise is randomly distributed, it is assumed that, provided sufficient items are tested, the rank-ordering of persons along the latent trait by raw score will not change, but will simply undergo a linear rescaling. Three-parameter IRT, by contrast, achieves data-model fit by selecting a model that fits the data,[15] at the expense of sacrificing specific objectivity. In practice, the Rasch model has at least two principal advantages in comparison to the IRT approach. The first advantage is the primacy of Rasch's specific requirements,[16] which (when met) provides fundamental person-free measurement (where persons and items can be mapped onto the same invariant scale).[17] Another advantage of the Rasch approach is that estimation of parameters is more straightforward in Rasch models due to the presence of sufficient statistics, which in this application means a one-to-one mapping of raw number-correct scores to Rasch estimates.[18]
142
143
Information
One of the major contributions of item response theory is the extension of the concept of reliability. Traditionally, reliability refers to the precision of measurement (i.e., the degree to which measurement is free of error). And traditionally, it is measured using a single index defined in various ways, such as the ratio of true and observed score variance. This index is helpful in characterizing a test's average reliability, for example in order to compare two tests. But IRT makes it clear that precision is not uniform across the entire range of test scores. Scores at the edges of the test's range, for example, generally have more error associated with them than scores closer to the middle of the range. Item response theory advances the concept of item and test information to replace reliability. Information is also a function of the model parameters. For example, according to Fisher information theory, the item information supplied in the case of the 1PL for dichotomous response data is simply the probability of a correct response multiplied by the probability of an incorrect response, or,
The standard error of estimation (SE) is the reciprocal of the test information of at a given trait level, is the
Thus more information implies less error of measurement. For other models, such as the two and three parameters models, the discrimination parameter plays an important role in the function. The item information function for the two parameter model is
[19]
In general, item information functions tend to look bell-shaped. Highly discriminating items have tall, narrow information functions; they contribute greatly but over a narrow range. Less discriminating items provide less information but over a wider range. Plots of item information can be used to see how much information an item contributes and to what portion of the scale score range. Because of local independence, item information functions are additive. Thus, the test information function is simply the sum of the information functions of the items on the exam. Using this property with a large item bank, test information functions can be shaped to control measurement error very precisely. Characterizing the accuracy of test scores is perhaps the central issue in psychometric theory and is a chief difference between IRT and CTT. IRT findings reveal that the CTT concept of reliability is a simplification. In the place of reliability, IRT offers the test information function which shows the degree of precision at different values of theta, . These results allow psychometricians to (potentially) carefully shape the level of reliability for different ranges of ability by including carefully chosen items. For example, in a certification situation in which a test can only be passed or failed, where there is only a single "cutscore," and where the actually passing score is unimportant, a very efficient test can be developed by selecting only items that have high information near the cutscore. These items generally correspond to items whose difficulty is about the same as that of the cutscore.
144
Scoring
The person parameter represents the magnitude of latent trait of the individual, which is the human capacity or attribute measured by the test.[20] It might be a cognitive ability, physical ability, skill, knowledge, attitude, personality characteristic, etc. The estimate of the person parameter - the "score" on a test with IRT - is computed and interpreted in a very different manner as compared to traditional scores like number or percent correct. The individual's total number-correct score is not the actual score, but is rather based on the IRFs, leading to a weighted score when the model contains item discrimination parameters. It is actually obtained by multiplying the item response function for each item to obtain a likelihood function, the highest point of which is the maximum likelihood estimate of . This highest point is typically estimated with IRT software using the Newton-Raphson method.[21] While scoring is much more sophisticated with IRT, for most tests, the (linear) correlation between the theta estimate and a traditional score is very high; often it is .95 or more. A graph of IRT scores against traditional scores shows an ogive shape implying that the IRT estimates separate individuals at the borders of the range more than in the middle. An important difference between CTT and IRT is the treatment of measurement error, indexed by the standard error of measurement. All tests, questionnaires, and inventories are imprecise tools; we can never know a person's true score, but rather only have an estimate, the observed score. There is some amount of random error which may push the observed score higher or lower than the true score. CTT assumes that the amount of error is the same for each examinee, but IRT allows it to vary.[22] Also, nothing about IRT refutes human development or improvement or assumes that a trait level is fixed. A person may learn skills, knowledge or even so called "test-taking skills" which may translate to a higher true-score. In fact, a portion of IRT research focuses on the measurement of change in trait level.[23]
145
where
is the point biserial correlation of item i. Thus, if the assumption holds, where there is a higher
discrimination there will generally be a higher point-biserial correlation. Another similarity is that while IRT provides for a standard error of each estimate and an information function, it is also possible to obtain an index for a test as a whole which is directly analogous to Cronbach's alpha, called the separation index. To do so, it is necessary to begin with a decomposition of an IRT estimate into a true location and error, analogous to decomposition of an observed score into a true score and error in CTT. Let
where
is an estimate of the
standard deviation of
for person with a given weighted score and the separation index is obtained as follows
where the mean squared standard error of person estimate gives an estimate of the variance of the errors, , across persons. The standard errors are normally produced as a by-product of the estimation process. The separation index is typically very close in value to Cronbach's alpha.[25] IRT is sometimes called strong true score theory or modern mental test theory because it is a more recent body of theory and makes more explicit the hypotheses that are implicit within CTT.
References
[1] A. van Alphen, R. Halfens, A. Hasman and T. Imbos. (1994). Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing. 20, 196-201 [2] ETS Research Overview (http:/ / www. ets. org/ portal/ site/ ets/ menuitem. c988ba0e5dd572bada20bc47c3921509/ ?vgnextoid=26fdaf5e44df4010VgnVCM10000022f95190RCRD& vgnextchannel=ceb2be3a864f4010VgnVCM10000022f95190RCRD) [3] Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park, CA: Sage Press. [7] Thissen, D. & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & Wainer, H. (Eds.), Test Scoring (pp. 73-140). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. [8] K. G. Jreskog and D. Srbom(1988). PRELIS 1 user's manual, version 1. Chicago: Scientific Software, Inc. [9] Abramowitz M., Stegun I.A. (1972). Handbook of Mathematical Functions. Washington DC: U. S. Government Printing Office. [11] Andrich, D (1989), Distinctions between assumptions and requirements in measurement in the Social sciences", in Keats, J.A, Taft, R., Heath, R.A, Lovibond, S (Eds), Mathematical and Theoretical Systems, Elsevier Science Publishers, North Holland, Amsterdam, pp.7-16. [12] Steinberg, J. (2000). Frederic Lord, Who Devised Testing Yardstick, Dies at 87. New York Times, February 10, 2000 [16] Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press. [18] Fischer, G.H. & Molenaar, I.W. (1995). Rasch Models: Foundations, Recent Developments, and Applications. New York: Springer. [19] de Ayala, R.J. (2009). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press. (6.12), p.144 [20] Lazarsfeld P.F, & Henry N.W. (1968). Latent Structure Analysis. Boston: Houghton Mifflin. [23] Hall, L.A., & McDonald, J.L. (2000). Measuring Change in Teachers' Perceptions of the Impact that Staff Development Has on Teaching. (http:/ / eric. ed. gov/ ERICWebPortal/ custom/ portlets/ recordDetails/ detailmini. jsp?_nfpb=true& _& ERICExtSearch_SearchValue_0=ED441789& ERICExtSearch_SearchType_0=no& accno=ED441789) Paper presented at the Annual Meeting of the American Educational Research Association (New Orleans, LA, April 2428, 2000). [24] Lord, F.M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
146
Additional reading
Many books have been written that address item response theory or contain IRT or IRT-like models. This is a partial list, focusing on texts that provide more depth. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Erlbaum. This book summaries much of Lord's IRT work, including chapters on the relationship between IRT and classical methods, fundamentals of IRT, estimation, and several advanced topics. Its estimation chapter is now dated in that it primarily discusses joint maximum likelihood method rather than the marginal maximum likelihood method implemented by Darrell Bock and his colleagues. Embretson, Susan E.; Reise, Steven P. (2000). Item Response Theory for Psychologists (http://books.google. com/books?id=rYU7rsi53gQC). Psychology Press. ISBN978-0-8058-2819-1. This book is an accessible introduction to IRT, aimed, as the title says, at psychologists. Baker, Frank (2001). The Basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD. This introductory book is by one of the pioneers in the field, and is available online at (http:/ / edres. org/ irt/ baker/) Baker, Frank B.; Kim, Seock-Ho (2004). Item Response Theory: Parameter Estimation Techniques (http:// books.google.com/books?id=y-Q_Q7pasJ0C) (2nd ed.). Marcel Dekker. ISBN978-0-8247-5825-7. This book describes various item response theory models and furnishes detailed explanations of algorithms that can be used to estimate the item and ability parameters. Portions of the book are available online as limited preview at Google Books. van der Linden, Wim J.; Hambleton, Ronald K., eds. (1996). Handbook of Modern Item Response Theory (http:// books.google.com/books?id=aytUuwl4ku0C). Springer. ISBN978-0-387-94661-0. This book provides a comprehensive overview regarding various popular IRT models. It is well suited for persons who already have gained basic understanding of IRT. de Boeck, Paul; Wilson, Mark (2004). Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach (http://books.google.com/books?id=pDeLy5L14mAC). Springer. ISBN978-0-387-40275-8. This volume shows an integrated introduction to item response models, mainly aimed at practitioners, researchers and graduate students. Fox, Jean-Paul (2010). Bayesian Item Response Modeling: Theory and Applications (http://books.google.com/ books?id=BZcPc4ffSTEC). Springer. ISBN978-1-4419-0741-7. This book discusses the Bayesian approach towards item response modeling. The book will be useful for persons (who are familiar with IRT) with an interest in analyzing item response data from a Bayesian perspective.
147
External links
"HISTORY OF ITEM RESPONSE THEORY (up to 1982)" (http://www.uic.edu/classes/ot/ot540/history. html), University of Illinois at Chicago A Simple Guide to the Item Response Theory(PDF) (http://www.creative-wisdom.com/computer/sas/IRT. pdf) Psychometric Software Downloads (http://www.umass.edu/remp/main_software.html) flexMIRT IRT Software (http://flexMIRT.VPGCentral.com) IRT Tutorial (http://work.psych.uiuc.edu/irt/tutorial.asp) IRT Tutorial FAQ (http://sites.google.com/site/benroydo/irt-tutorial) An introduction to IRT (http://edres.org/irt/) The Standards for Educational and Psychological Testing (http://www.apa.org/science/standards.html) IRT Command Language (ICL) computer program (http://www.b-a-h.com/software/irt/icl/) IRT Programs from SSI, Inc. (http://www.ssicentral.com/irt/index.html) IRT Programs from Assessment Systems Corporation (http://assess.com/xcart/home.php?cat=37) IRT Programs from Winsteps (http://www.winsteps.com) Latent Trait Analysis and IRT Models (http://www.john-uebersax.com/stat/lta.htm) Rasch analysis (http://www.rasch-analysis.com/) Free IRT software (http://www.john-uebersax.com/stat/papers.htm) IRT Packages in R (http://cran.r-project.org/web/views/Psychometrics.html)
External links
Further information [1]
References
[1] http:/ / www. cps. nova. edu/ ~cpphelp/ JAS. html
Jensen box
148
Jensen box
The Jensen box was developed by University of California, Berkeley psychologist Arthur Jensen as a standard apparatus for measuring choice reaction time, especially in relationship to differences in intelligence.[1] Since Jensen created this approach, correlations between simple and choice reaction time have been demonstrated in many hundreds of studies. Perhaps the best was reported by Ian Deary and colleagues, in a population-based cohort study of 900 individuals, demonstrating correlations between IQ simple and choice reaction time of 0.3 and 0.5 respectively, and of 0.26 with the degree of variation between trials shown by an individual.[]
The standard box is around 20 inches wide and 12 deep, with a sloping face on which 8 buttons are arrayed in a semicircle, with a 'home' key in the lower center. Above each response button lies a small LED which can be illuminated, and the box contains a loudspeaker to play alerting sounds. Following Hick's law,[2] reaction times (RTs) slow as a log2 of the number of choices are presented. Thus when all but one button is covered responses are fastest, and slowest when all 8 responses are available. Several parameters can be extracted: The mean 1-choice RT gives simple reaction time. The slope of the function across 1, 2, 4, and 8 lights gives the rate of information processing, an variance or standard deviation in RTs can be extracted to give a measure of response variability within subjects. Finally, the time to lift off the home button and the time to hit the response button can be measured separately, and these are typically thought of as assessing decision time, and movement time, though in the standard paradigm, subjects can shift decision time into the movement phase by lifting off the home button while the location computation is still incomplete. Masking the stimulus light can eliminate this artifact.[] Simple reaction time correlates around .4 with general ability,[] and there is some evidence that the slope of responding does also, so long as access to the stimulus is controlled.[]
References
[1] A. R. Jensen. (1987). Individual differences in the Hick paradigm. In Speed of information-processing and intelligence. P. A. Vernon and et al., Norwood, NJ, USA, Ablex Publishing Corp, 101-175.
KuderRichardson Formula 20
149
KuderRichardson Formula 20
In statistics, the KuderRichardson Formula 20 (KR-20) first published in 1937[1] is a measure of internal consistency reliability for measures with dichotomous choices. It is analogous to Cronbach's , except Cronbach's is also used for non-dichotomous (continuous) measures.[2] A high KR-20 coefficient (e.g., >0.90) indicates a homogeneous test. Values can range from 0.00 to 1.00 (sometimes expressed as 0 to 100), with high values indicating that the examination is likely to correlate with alternate forms (a desirable characteristic). The KR-20 may be affected by difficulty of the test, the spread in scores and the length of the examination. In the case when scores are not tau-equivalent (for example when there is not homogeneous but rather examination items of increasing difficulty) then the KR-20 is an indication of the lower bound of internal consistency (reliability). The KR-20 formula can't be used when multiple-choice questions involve partial credit, and it requires detailed item analysis.[3]
where K is the number of test items (i.e. the length of the test), p is the proportion of correct responses to the test item, q is the proportion of incorrect responses to the test item (so that p + q = 1), and the variance for the denominator is
where n is the total sample size. If it is important to use unbiased operators then the sum of squares should be divided by degrees of freedom (n1) and the probabilities are multiplied by
Since Cronbach's was published in 1951, there has been no known advantage to KR-20 over Cronbach. KR-20 is seen as a derivative of the Cronbach formula, with the advantage to Cronbach that it can handle both dichotomous and continuous variables.
References
[1] Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2(3), 151160. [2] Cortina, J. M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78(1), 98104. [3] http:/ / chemed. chem. purdue. edu/ chemed/ stats. html (as of 3/27/2013
External links
Statistical analysis of multiple choice exams (http://chemed.chem.purdue.edu/chemed/stats.html) Quality of assessment chapter in Illinois State Assessment handbook (1995) (http://www.gower.k12.il.us/ Staff/ASSESS/4_ch2app.htm)
Latent variable
150
Latent variable
In statistics, latent variables (as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models. Latent variable models are used in many disciplines, including psychology, economics, machine learning/artificial intelligence, bioinformatics, natural language processing, management and the social sciences. Sometimes latent variables correspond to aspects of physical reality, which could in principle be measured, but may not be for practical reasons. In this situation, the term hidden variables is commonly used (reflecting the fact that the variables are "really there", but hidden). Other times, latent variables correspond to abstract concepts, like categories, behavioral or mental states, or data structures. The terms hypothetical variables or hypothetical constructs may be used in these situations. One advantage of using latent variables is that it reduces the dimensionality of data. A large number of observable variables can be aggregated in a model to represent an underlying concept, making it easier to understand the data. In this sense, they serve a function similar to that of scientific theories. At the same time, latent variables link observable ("sub-symbolic") data in the real world to symbolic data in the modeled world. Latent variables, as created by factor analytic methods, generally represent 'shared' variance, or the degree to which variables 'move' together. Variables that have no correlation cannot result in a latent construct based on the common factor model.[1]
Psychology
The "Big Five personality traits" have been inferred using factor analysis. extraversion[] spatial ability[] wisdom Two of the more predominant means of assessing wisdom include wisdom-related performance and latent variable measures.[]
Latent variable
151
References
Background
Thurstone published a paper on the law of comparative judgment in 1927. In this paper he introduced the underlying concept of a psychological continuum for a particular 'project in measurement' involving the comparison between a series of stimuli, such as weights and handwriting specimens, in pairs. He soon extended the domain of application of the law of comparative judgment to things that have no obvious physical counterpart, such as attitudes and values (Thurstone, 1929). For example, in one experiment, people compared statements about capital punishment to judge which of each pair expressed a stronger positive (or negative) attitude. The essential idea behind Thurstone's process and model is that it can be used to scale a collection of stimuli based on simple comparisons between stimuli two at a time: that is, based on a series of pairwise comparisons. For example, suppose that someone wishes to measure the perceived weights of a series of five objects of varying masses. By having people compare the weights of the objects in pairs, data can be obtained and the law of comparative judgment applied to estimate scale values of the perceived weights. This is the perceptual counterpart to the physical weight of the objects. That is, the scale represents how heavy people perceive the objects to be based on
Law of comparative judgment the comparisons. Although Thurstone referred to it as a law, as stated above, in terms of modern psychometric theory the 'law' of comparative judgment is more aptly described as a measurement model. It represents a general theoretical model which, applied in a particular empirical context, constitutes a scientific hypothesis regarding the outcomes of comparisons between some collection of objects. If data agree with the model, it is possible to produce a scale from the data.
152
153
in which: is the psychological scale value of stimuli i is the sigma corresponding with the proportion of occasions on which the magnitude of stimulus i is judged
to exceed the magnitude of stimulus j is the discriminal dispersion of a stimulus is the correlation between the discriminal deviations of stimuli i and j The discriminal dispersion of a stimulus i is the dispersion of fluctuations of the discriminal process for a uniform repeated stimulus, denoted , where represents the mode of such values. Thurstone (1959, p. 20) used the term discriminal process to refer to the "psychological values of psychophysics"; that is, the values on a psychological continuum associated with a given stimulus.
where
In this case of the model, the difference is judged greater than i if it is hypothesised that choice of the unit of measurement. Letting for example,
can be inferred directly from the proportion of instances in which j is distributed according to some density function, such as the , which is in effect an arbitrary
normal distribution or logistic function. In order to do so, it is necessary to let and it is hypothesised that
be the proportion of occasions on which i is judged greater than j, if, is normally distributed, then it would be inferred that
. When a simple logistic function is employed instead of the normal density function, then the model has the structure of the Bradley-Terry-Luce model (BTL model) (Bradley & Terry, 1952; Luce, 1959). In turn, the Rasch model for dichotomous data (Rasch, 1960/1980) is identical to the BTL model after the person parameter of the Rasch model has been eliminated, as is achieved through statistical conditioning during the process of Conditional Maximum Likelihood estimation. With this in mind, the specification of uniform discriminal dispersions is equivalent to the requirement of parallel Item Characteristic Curves (ICCs) in the Rasch model. Accordingly, as shown by Andrich (1978), the Rasch model should, in principle, yield essentially the same results as those obtained from a Thurstone scale. Like the Rasch model, when applied in a given empirical context, Case 5 of the LCJ constitutes a mathematized hypothesis which embodies theoretical criteria for measurement.
154
Applications
One important application involving the law of comparative judgment is the widely-used Analytic Hierarchy Process, a structured technique for helping people deal with complex decisions. It uses pairwise comparisons of tangible and intangible factors to construct ratio scales that are useful in making important decisions.[1][]
References
Andrich, D. (1978b). Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2, 449-460. Bradley, R.A. and Terry, M.E. (1952). Rank analysis of incomplete block designs, I. the method of paired comparisons. Biometrika, 39, 324-345. Krus, D.J., & Kennedy, P.H. (1977) Normal scaling of dominance matrices: The domain-referenced model. Educational and Psychological Measurement, 37, 189-193 (Request reprint). (http://www.visualstatistics.net/ Scaling/Domain Referenced Scaling/Domain-Referenced Scaling.htm) Luce, R.D. (1959). Individual Choice Behaviours: A Theoretical Analysis. New York: J. Wiley. Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of Psychology, 88, 355-383. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press. Thurstone, L.L. (1927). A law of comparative judgement. Psychological Review, 34, 273-286. Thurstone, L.L. (1929). The Measurement of Psychological Value. In T.V. Smith and W.K. Wright (Eds.), Essays in Philosophy by Seventeen Doctors of Philosophy of the University of Chicago. Chicago: Open Court. Thurstone, L.L. (1959). The Measurement of Values. Chicago: The University of Chicago Press.
External links
"The Measurement of Pyschological Value" (http://www.brocku.ca/MeadProject/Thurstone/ Thurstone_1929a.html) How to Analyze Paired Comparisons (tutorial on using Thurstone's Law of Comparative Judgement) (http:// www.ee.washington.edu/research/guptalab/publications/ PairedComparisonTutorialTsukidaGuptaUWTechReport2011.pdf) L.L. Thurstone psychometric laboratory (http://www.unc.edu/depts/quantpsy/thurstone/history.htm)
Likert scale
155
Likert scale
A Likert scale (pron.: /lkrt/[1]) is a psychometric scale commonly involved in research that employs questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, or more accurately the Likert-type scale, even though the two are not synonymous. The scale is named after its inventor, psychologist Rensis Likert.[2] Likert distinguished between a scale proper, which emerges from collective responses to a set of items (usually eight or more), and the format in which responses are scored along a range. Technically speaking, a Likert scale refers only to the former. The difference between these two concepts has to do with the distinction Likert made between the underlying phenomenon being investigated and the means of capturing variation that points to the underlying phenomenon.[3] When responding to a Likert questionnaire item, respondents specify their level of agreement or disagreement on a symmetric agree-disagree scale for a series of statements. Thus, the range captures the intensity of their feelings for a given item.[4] A scale can be created as the simple sum questionnaire responses over the full range of the scale. In so doing, Likert scaling assumes that distances on each item are equal. Importantly, "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments" [5] (p.197). By contrast modern test theory treats the difficulty of each item (the ICCs) as information to be incorporated in scaling items.
A Likert scale pertaining to Wikipedia can be calculated using these five Likert items.
Likert scale Likert scaling is a bipolar scaling method, measuring either positive or negative response to a statement. Sometimes an even-point scale is used, where the middle option of "Neither agree nor disagree" is not available. This is sometimes called a "forced choice" method, since the neutral option is removed.[8] The neutral option can be seen as an easy option to take when a respondent is unsure, and so whether it is a true neutral option is questionable. A 1987 study found negligible differences between the use of "undecided" and "neutral" as the middle option in a 5-point Likert scale.[9] Likert scales may be subject to distortion from several causes. Respondents may avoid using extreme response categories (central tendency bias); agree with statements as presented (acquiescence bias); or try to portray themselves or their organization in a more favorable light (social desirability bias). Designing a scale with balanced keying (an equal number of positive and negative statements) can obviate the problem of acquiescence bias, since acquiescence on positively keyed items will balance acquiescence on negatively keyed items, but central tendency and social desirability are somewhat more problematic.
156
Likert scale test, or KruskalWallis test.[] While some commentators[13] consider that parametric analysis is justified for a Likert scale using the Central Limit Theorem, this should be reserved for when the Likert scale has suitable symmetry and equidistance so an interval-level measurement can be approximated and reasonably inferred. Responses to several Likert questions may be summed, providing that all questions use the same Likert scale and that the scale is a defensible approximation to an interval scale, in which case they may be treated as interval data measuring a latent variable. If the summed responses fulfill these assumptions, parametric statistical tests such as the analysis of variance can be applied. These can be applied only when 4 to 8 Likert questions (preferably closer to 8) are summed.[14] Data from Likert scales are sometimes converted to binomial data by combining all agree and disagree responses into two categories of "accept" and "reject". The chi-squared, Cochran Q, or McNemar test are common statistical procedures used after this transformation. Consensus based assessment (CBA) can be used to create an objective standard for Likert scales in domains where no generally accepted or objective standard exists. Consensus based assessment (CBA) can be used to refine or even validate generally accepted standards.
157
Level of measurement
The five response categories are often believed to represent an Interval level of measurement. But this can only be the case if the intervals between the scale points correspond to empirical observations in a metric sense. Reips and Funke (2008)[15] show that this criterion is much better met by a visual analogue scale. In fact, there may also appear phenomena which even question the ordinal scale level in Likert scales. For example, in a set of items A,B,C rated with a Likert scale circular relations like A>B, B>C and C>A can appear. This violates the axiom of transitivity for the ordinal scale.
Rasch model
Likert scale data can, in principle, be used as a basis for obtaining interval level estimates on a continuum by applying the polytomous Rasch model, when data can be obtained that fit this model. In addition, the polytomous Rasch model permits testing of the hypothesis that the statements reflect increasing levels of an attitude or trait, as intended. For example, application of the model often indicates that the neutral category does not represent a level of attitude or trait between the disagree and agree categories. Again, not every set of Likert scaled items can be used for Rasch measurement. The data has to be thoroughly checked to fulfill the strict formal axioms of the model.
Pronunciation
Rensis Likert, the developer of the scale, pronounced his name 'lick-urt' with a short "i" sound.[16][17] It has been claimed that Likert's name "is among the most mispronounced in [the] field",[18] as many people pronounce it with a diphtong "i" sound ('lie-kurt').
References
[3] Carifio, James and Rocco J. Perla. (2007) Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes. Journal of Social Sciences 3 (3): 106-116 [5] A. van Alphen, R. Halfens, A. Hasman and T. Imbos. (1994). Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing. 20, 196-201 [8] Allen, Elaine and Seaman, Christopher (2007). "Likert Scales and Data Analyses". Quality Progress 2007, 64-65. [9] Armstrong, Robert (1987). "The midpoint on a Five-Point Likert-Type Scale". Perceptual and Motor Skills: Vol 64, pp359-362. [10] Jamieson, Susan (2004). Likert Scales: How to (Ab)use Them, Medical Education, Vol. 38(12), pp.1217-1218
Likert scale
[11] Norman, Geoff (2010). Likert scales, levels of measurement and the laws of statistics. Advances in Health Science Education. Vol 15(5) pp625-632 [12] Jamieson, Susan (2004) [13] Norman, Geoff (2010) [14] Carifio and Perla, 2007, Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes. Journal of Social Sciences 3 (3): 106-116.
158
External links
Carifio (2007). "Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes" (http://www.comp.dit.ie/dgordon/Courses/ ResearchMethods/likertscales.pdf). Retrieved September 19, 2011. Unknown parameter |unused_data= ignored (help) Trochim, William M. (October 20, 2006). "Likert Scaling" (http://www.socialresearchmethods.net/kb/scallik. php). Research Methods Knowledge Base, 2nd Edition. Retrieved April 30, 2009. Uebersax, John S. (2006). "Likert Scales: Dispelling the Confusion" (http://www.john-uebersax.com/stat/ likert.htm). Retrieved August 17, 2009. "A search for the optimum feedback scale" (http://www.getfeedback.net/kb/ Choosing-the-optimium-feedback-scale). Getfeedback. Correlation scatter-plot matrix - for ordered-categorical data (http://www.r-statistics.com/2010/04/ correlation-scatter-plot-matrix-for-ordered-categorical-data/) - On the visual presentation of correlation between Likert scale variables Net stacked distribution of Likert data (http://www.organizationview.com/ net-stacked-distribution-a-better-way-to-visualize-likert-data/) - Method of visualizing Likert data to highlight differences from a central neutral value.
Linear-on-the-fly testing
Linear-on-the-fly testing, often referred to as LOFT, is a method of delivering educational or professional examinations. Competing methods include traditional linear fixed-form delivery and computerized adaptive testing. LOFT is a compromise between the two, in an effort to maintain the equivalence of the set of items that each examinee sees, which is found in fixed-form delivery, while attempting to reduce item exposure and enhance test security. Fixed-form delivery, which most people are familiar with, entails the testing organization determining one or several fixed sets of items to be delivered together. For example, suppose the test contains 100 items, and the organization wished for two forms. Two forms are published with a fixed set of 100 items each, some of which should overlap to enable equating. All examinees that take the test are given one of the two forms. If this exam is high volume, meaning that there is a large number of examinees, the security of the examination could be in jeopardy. Many of the test items would become well known in the population of examinees. To offset this, more forms would be needed; if there were eight forms, not as many examinees would see each item. LOFT takes this to an extreme, and attempts to construct a unique exam for each candidate, within the given constraints of the testing program. Rather than publishing a fixed set of items, a large pool of items is delivered to the computer on which the examinee is taking the exam. Also delivered is a computer program to pseudo-randomly select items so that every examinee will receive a test that is equivalent with respect to content and statistical characeristics,[1] although composed of a different set of items. This is usually done with item response theory.
Linear-on-the-fly testing
159
References
[1] Luecht, R.M. (2005). Some Useful Cost-Benefit Criteria for Evaluating Computer-based Test Delivery Models and Systems. Journal of Applied Testing Technology, 7(2). (http:/ / www. testpublishers. org/ Documents/ JATT2005_rev_Criteria4CBT_RMLuecht_Apr2005. pdf)
Frederic M. Lord
Frederic M. Lord (Nov 12, 1912 in Hanover, NH - Feb 5, 2000) was a psychometrician for Educational Testing Service. He was the source of much of the seminal research on item response theory,[1] including two important books: Statistical Theories of Mental Test Scores (1968, with Melvin Novick, and two chapters by Allen Birnbaum), and Applications of Item Response Theory to Practical Testing Problems (1980). Lord has been called the "Father of Modern Testing."[2]
References
[1] ETS Research Overview (http:/ / www. ets. org/ portal/ site/ ets/ menuitem. c988ba0e5dd572bada20bc47c3921509/ ?vgnextoid=26fdaf5e44df4010VgnVCM10000022f95190RCRD& vgnextchannel=ceb2be3a864f4010VgnVCM10000022f95190RCRD) [2] NCME News: Frederic Lord, Father of Modern Testing, Dies at 87 (http:/ / www. ncme. org/ news/ newsdetail. cfm?ID=21& ArchView=y)
Measurement invariance
Measurement invariance or measurement equivalence is a statistical property of measurement that indicates that the same construct is being measured across some specified groups. For example, measurement invariance can be used to study whether a given measure is interpreted in a conceptually similar manner by respondents representing different genders or cultural backgrounds. Violations of measurement invariance may preclude meaningful interpretation of measurement data. Tests of measurement invariance are increasingly used in fields such as psychology to supplement evaluation of measurement quality rooted in classical test theory.[1] Measurement invariance is relevant in the context of latent variables. Measurement invariance is supported if relationships between manifest indicator variables and the latent construct are the same across groups. Measurement invariance is usually tested in the framework of multiple-group confirmatory factor analysis.[2]
References
[1] Vandenberg, Robert J. & Lance, Charles E. (2000). A Review and Synthesis of the Measurement Invariance Literature: Suggestions, Practices, and Recommendations for Organizational Research. Organizational Research Methods, 3, 470 [2] Chen, Fang Fang, Sousa, Karen H., and West, Stephen G. (2005). Testing Measurement Invariance of Second-Order Factor Models. Structural Equation Modeling, 12, 471492
Mediation (statistics)
160
Mediation (statistics)
In statistics, a mediation model is one that seeks to identify and explicate the mechanism or process that underlies an observed relationship between an independent variable and a dependent variable via the inclusion of a third explanatory variable, known as a A simple statistical mediation model. mediator variable. Rather than hypothesizing a direct causal relationship between the independent variable and the dependent variable, a mediational model hypothesizes that the independent variable influences the mediator variable, which in turn influences the dependent variable. Thus, the mediator variable serves to clarify the nature of the relationship between the independent and dependent variables.[1] In other words, mediating relationships occur when a third variable plays an important role in governing the relationship between the other two variables. Researchers are now focusing their studies on better understanding known findings. Mediation analyses are employed to understand a known relationship by exploring the underlying mechanism or process by which one variable (X) influences another variable (Y). For example, a cause X of some variable (Y) presumably precedes Y in time and has a generative mechanism that accounts for its impact on Y.[2] Thus, if gender is thought to be the cause of some characteristic, one assumes that other social or biological mechanisms are present in the concept of gender that can explain how gender-associated differences arise. The explicit inclusion of such a mechanism is called a mediator.
11 is significant Step2: Regress the mediator on the independent variable. In other words, confirm that the independent variable is a significant predictor of the mediator. If the mediator is not associated with the independent variable, then it couldnt possibly mediate anything. Independent Variable Mediator
21 is significant Step 3: Regress the dependent variable on both the mediator and independent variable. In other words, confirm that the mediator is a significant predictor of the dependent variable, while controlling for the independent variable.
Mediation (statistics) This step involves demonstrating that when the mediator and the independent variable are used simultaneously to predict the dependent variable, the previously significant path between the independent and dependent variable (Step #1) is now greatly reduced, if not nonsignificant. In other words, if the mediator were to be removed from the relationship, the relationship between the independent and dependent variables would be noticeably reduced.
161
32 is significant 31 should be smaller in absolute value than the original mediation effect (11 above) Example The following example, drawn from Howell (2009),[4] explains each step of Baron and Kennys requirements to understand further how a mediation effect is characterized. Step 1 and step 2 use simple regression analysis, whereas step 3 uses multiple regression analysis. Step 1: How you were parented (i.e., independent variable) predicts how confident you feel about parenting your own children (i.e., dependent variable). How you were parented Step 2: How you were parented (i.e., independent variable) predicts your feelings of competence and self-esteem (i.e., mediator). How you were parented Step 3: Your feelings of competence and self-esteem (i.e., mediator) predict how confident you feel about parenting your own children (i.e., dependent variable), while controlling for how you were parented (i.e., independent variable). Such findings would lead to the conclusion implying that your feelings of competence and self-esteem mediate the relationship between how you were parented and how confident you feel about parenting your own children. Note: If step 1 does not yield a significant result, one may still have grounds to move to step 2. Sometimes there is actually a significant relationship between independent and dependent variables but because of small sample sizes, or other extraneous factors, there could not be enough power to predict the effect that actually exists (See Shrout & Bolger, 2002 [5] for more info). Feelings of competence and self-esteem. Confidence in own parenting abilities.
Mediation (statistics)
162
Sobel's Test
As mentioned above, Sobels test[] is calculated to determine if the relationship between the independent variable and dependent variable has been significantly reduced after inclusion of the mediator variable. In other words, this test assesses whether a mediation effect is significant. Examines the relationship between the independent variable and the dependent variable compared to the relationship between the independent variable and dependent variable including the mediation factor. The Sobel test is more accurate than the Baron and Kenny steps explained above, however it does have low statistical power. As such, large sample sizes are required in order to have sufficient power to detect significant effects. This is because the key assumption of Sobels test is the assumption of normality. Because Sobels test evaluates a given sample on the normal distribution, small sample sizes and skewness of the sampling distribution can be problematic (See Normal Distribution for more details). Thus, the general rule of thumb as suggested by MacKinnon et al., (2002) [8] is that a sample size of 1000 is required to detect a small effect, a sample size of 100 is sufficient in detecting a medium effect, and a sample size of 50 is required to detect a large effect.
Mediation (statistics)
163
Significance of mediation
As outlined above, there are a few different options one can choose from to evaluate a mediation model. Bootstrapping[9][10] is becoming the most popular method of testing mediation because it does not require the normality assumption to be met, and because it can be effectively utilized with smaller sample sizes (N<25). However, mediation continues to be most frequently determined using the logic of Baron and Kenny [11] or the Sobel test. It is becoming increasingly more difficult to publish tests of mediation based purely on the Baron and Kenny method or tests that make distributional assumptions such as the Sobel test. Thus, it is important to consider your options when choosing which test to conduct.[]
Approaches to Mediation
While the concept of mediation as defined within psychology is theoretically appealing, the methods used to study mediation empirically have been challenged by statisticians and epidemiologists[][7][12] and interpreted formally.[6] (1) Experimental-Causal-Chain Design An experimental-causal-chain design is used when the proposed mediator is experimental manipulated. Such a design implies that one manipulates some controlled third variable that they have reason to believe could be the underlying mechanism of a given relationship. (2) Measurement-of-Mediation Design A measurement-of-mediation design can be conceptualized as a statistical approach. Such a design implies that one measures the proposed intervening variable and then uses statistical analyses to establish mediation. This approach does not involve manipulation of the hypothesized mediating variable, but only involves measurement. See Spencer et al., 2005 [13] for a discussion on the approaches mentioned above.
Mediation (statistics) evidence to this disparagement. Specifically, the following counter arguments have been put forward:[2] (1) Temporal precedence. For example, if the independent variable precedes the dependent variable in time, this would provide evidence suggesting a directional, and potentially causal, link from the independent variable to the dependent variable. (2) Nonspuriousness and/or no confounds. For example, should one identify other third variables and prove that they do not alter the relationship between the independent variable and the dependent variable he/she would have a stronger argument for their mediation effect. See other 3rd variables below. Mediation can be an extremely useful and powerful statistical test, however it must be used properly. It is important that the measures used to assess the mediator and the dependent variable are theoretically distinct and that the independent variable and mediator cannot interact. Should there be an interaction between the independent variable and the mediator one would have grounds to investigate moderation.
164
Mediation (statistics)
165
Mediator Variable
A mediator variable (or mediating variable, or intervening variable) in statistics is a variable that describes how, rather than when, effects will occur by accounting for the relationship between the independent and dependent variables. A mediating relationship is one in which the path relating A to C is mediated by a third variable (B). For example, a mediating variable explains the actual relationship between the following variables. Most people will agree that older drivers (up to a certain point), are better drivers. Thus: Aging Better driving
But what is missing from this relationship is a mediating variable that is actually causing the improvement in driving: experience. The mediated relationship would look like the following: Aging Increased experience driving a car Better driving
Mediating variables are often contrasted with moderating variables, which pinpoint the conditions under which an independent variable exerts its effects on a dependent variable.
Moderated Mediation
Mediation and moderation can co-occur in statistical models. It is possible to mediate moderation and moderate mediation. Moderated mediation is when the effect of the treatment effect A on the mediator B, and/or when the partial effect of B on C, depends on levels of another variable (D). Essentially, in moderated mediation, mediation is first established, and then one investigates if the mediation effect that describes the relationship between the independent variable and dependent variable is moderated by different levels of another variable (i.e., a moderator). This definition has been outlined by Muller, Judd, and Yzerbyt (2005)[] and Preacher, Rucker, and Hayes (2007).[14]
Mediated Moderation
Mediated moderation is a variant of both moderation and mediation. This is where there is initially overall moderation and the direct effect of the moderator variable on the outcome is mediated either at the A path in the diagram, between the independent A simple statistical moderation model. variable and the moderating variable, or at the B path, between the moderating variable and the dependent variable. The main difference between mediated moderation and moderated mediation is that for the former there is initial moderation and this effect is mediated and for the latter there is no moderation but the effect of either the treatment on the mediator (path A) is moderated or the effect of the mediator on the outcome (path B) is moderated.[] In order to establish mediated moderation, one must first establish moderation, meaning that the direction and/or the strength of the relationship between the independent and dependent variables (path C) differs depending on the level of a third variable (the moderator variable). Researchers next look for the presence of mediated moderation when they have a theoretical reason to believe that there is a fourth variable that acts as the mechanism or process that causes the relationship between the independent variable and the moderator (path A) or between the moderator and the dependent variable (path C). Example
Mediation (statistics) The following is a published example of mediated moderation in psychological research.[15] Participants were presented with an initial stimulus (a prime) that made them think of morality or made them think of might. They then participated in the Prisoners Dilemma Game (PDG), in which participants pretend that they and their partner in crime have been arrested, and they must decide whether to remain loyal to their partner or to compete with their partner and cooperate with the authorities. The researchers found that prosocial individuals were affected by the morality and might primes, whereas proself individuals were not. Thus, social value orientation (proself vs. prosocial) moderated the relationship between the prime (independent variable: morality vs. might) and the behaviour chosen in the PDG (dependent variable: competitive vs. cooperative). The researchers next looked for the presence of a mediated moderation effect. Regression analyses revealed that the type of prime (morality vs. might) mediated the moderating relationship of participants social value orientation on PDG behaviour. Prosocial participants who experienced the morality prime expected their partner to cooperate with them, so they chose to cooperate themselves. Prosocial participants who experienced the might prime expected their partner to compete with them, which made them more likely to compete with their partner and cooperate with the authorities. In contrast, participants with a pro-self social value orientation always acted competitively. Models of Mediated Moderation There are five possible models of mediated moderation, as illustrated in the diagrams below.[] 1. In the first model the independent variable also mediates the relationship between the moderator and the dependent variable. 2. The second possible model of mediated moderation involves a new variable which mediates the relationship between the independent variable and the moderator (the A path). 3. The third model of mediated moderation involves a new mediator variable which mediates the relationship between the moderator and the dependent variable (the B path). 4. Mediated moderation can also occur when one mediating variable affects both the relationship between the independent variable and the moderator (the A path) and the relationship between the moderator and the dependent variable (the B path). 5. The fifth an final possible model of mediated moderation involves two new mediator variables, one mediating the A path and the other mediating the B path.
166
Fourth option: fourth variable Fifth option: fourth variable mediates both the A path and the mediates the A path and a fifth B path. variable mediates the B path.
Mediation (statistics)
167
To establish overall moderation, the 43 regression weight must be significant (first step for establishing mediated moderation). Establishing moderated mediation requires that there be no moderation effect, so the 43 regression weight must not be significant. Step 2: Moderation of the relationship between the independent variable and the mediator (path A).
If the 53 regression weight is significant, the moderator affects the relationship between the IV and the mediator. Step 3: Moderation of both the relationship between the independent and dependent variables (path A) and the relationship between the mediator and the dependent variable (path B).
If both 53 in step 2 and 64 in step 3 are significant, the moderator affects the relationship between the independent variable and the mediator (path A). If both 51 in step 2 and 65 in step 3 are significant, the moderator affects the relationship between the mediator and the dependent variable (path B). Either or both of the conditions above may be true.
References
Notes
[1] MacKinnon, D. P. (2008). Introduction to Statistical Mediation Analysis. New York: Erlbaum. [2] Cohen, J.; Cohen, P.; West, S. G.; Aiken, L. S. (2003) Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Erlbaum. [3] Baron, R. M. and Kenny, D. A. (1986) "The Moderator-Mediator Variable Distinction in Social Psychological Research Conceptual, Strategic, and Statistical Considerations", Journal of Personality and Social Psychology, Vol. 51(6), pp.11731182. [4] Howell, D. C. (2009). Statistical methods for psychology (7th ed.). Belmot, CA: Cengage Learning. [5] Shrout, P. E., & Bolger, N. (2002). Mediation in experimental and nonexperimental studies: New procedures and recommendations. Psychological Methods, 7(4), 422-445 [6] Pearl, J. (2001) "Direct and indirect effects" (http:/ / ftp. cs. ucla. edu/ pub/ stat_ser/ R273-U. pdf). Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, 411420. [7] Kaufman, J. S., MacLehose R. F., Kaufman S (2004). A further critique of the analytic strategy of adjusting for covariates to identify biologic mediation. Epidemiology Innovations and Perspectives, 1:4. [11] "Mediation" (http:/ / davidakenny. net/ cm/ mediate. htm). davidakenny.net. Retrieved April 25, 2012. [12] Bullock, J. G., Green, D. P., Ha, S. E. (2010). Yes, but what's the mechanism? (Don't expect an easy answer). Journal of Personality & Social Psychology, 98(4):550-558. [13] Spencer, S. J., Zanna, M. P., & Fong, G. T. (2005). Establishing a causal chain: why experiments are often more effective than meditational analyses in examining psychological processes. Attitudes and Social Cognition, 89(6): 845-851. [14] Preacher, K. J., Rucker, D. D. & Hayes, A. F. (2007). Assessing moderated mediation hypotheses: Strategies, methods, and prescriptions. Multivariate Behavioral Research, 42, 185227.
Bibliography
Mediation (statistics) Preacher, Kristopher J.; Hayes, Andrew F. (2004). "SPSS and SAS procedures for estimating indirect effects in simple mediation models" (http://www.afhayes.com/spss-sas-and-mplus-macros-and-code.html). Behavior Research Methods, Instruments, and Computers 36 (4): 717731. doi: 10.3758/BF03206553 (http://dx.doi.org/ 10.3758/BF03206553) Preacher, Kristopher J.; Hayes, Andrew F. (2008). "Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models" (http://www.afhayes.com/ spss-sas-and-mplus-macros-and-code.html). Behavior Research Methods 40 (3): 879891. doi: 10.3758/BRM.40.3.879 (http://dx.doi.org/10.3758/BRM.40.3.879). PMID 18697684 (http://www.ncbi. nlm.nih.gov/pubmed/18697684) Preacher, K. J.; Zyphur, M. J.; Zhang, Z. (2010). "A general multilevel SEM framework for assessing multilevel mediation". Psychological Methods 15 (3): 209233. doi: 10.1037/a0020141 (http://dx.doi.org/10.1037/ a0020141). PMID 20822249 (http://www.ncbi.nlm.nih.gov/pubmed/20822249) Baron, R. M. and Kenny, D. A. (1986) "The Moderator-Mediator Variable Distinction in Social Psychological Research Conceptual, Strategic, and Statistical Considerations", Journal of Personality and Social Psychology, Vol. 51(6), pp.11731182. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York, NY: Academic Press. Hayes, A. F. (2009). "Beyond Baron and Kenny: Statistical mediation analysis in the new millennium" (http:// www.informaworld.com/smpp/ftinterface~db=all~content=a917285720~fulltext=713240930). Communication Monographs 76 (4): 408420. doi: 10.1080/03637750903310360 (http://dx.doi.org/10.1080/ 03637750903310360). Howell, D. C. (2009). Statistical methods for psychology (7th ed.). Belmot, CA: Cengage Learning. MacKinnon, D. P.; Lockwood, C. M. (2003). "Advances in statistical methods for substance abuse prevention research" (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2843515). Prevention Science 4 (3): 155171. doi: 10.1023/A:1024649822872 (http://dx.doi.org/10.1023/A:1024649822872). PMC 2843515 (http:// www.ncbi.nlm.nih.gov/pmc/articles/PMC2843515). PMID 12940467 (http://www.ncbi.nlm.nih.gov/ pubmed/12940467). Preacher, K. J.; Kelley, K. (2011). "Effect sizes measures for mediation models: Quantitative strategies for communicating indirect effects". Psychological Methods 16 (2): 93115. doi: 10.1037/a0022658 (http://dx.doi. org/10.1037/a0022658). PMID 21500915 (http://www.ncbi.nlm.nih.gov/pubmed/21500915). Rucker, D.D., Preacher, K.J., Tormala, Z.L. & Petty, R.E. (2011). "Mediation analysis in social psychology: Current practices and new recommendations". Social and Personality Psychology Compass, 5/6, 359-371. Sobel, M. E. (1982). "Asymptotic confidence intervals for indirect effects in structural equation models". Sociological Methodology 13: 290312. doi: 10.2307/270723 (http://dx.doi.org/10.2307/270723). Spencer, S. J.; Zanna, M. P.; Fong, G. T. (2005). "Establishing a causal chain: why experiments are often more effective than meditational analyses in examining psychological processes". Attitudes and Social Cognition 89 (6): 845851.
168
External links
Summary of mediation methods at PsychWiki (http://www.psychwiki.com/wiki/Mediation) Example of Causal Mediation Using Propensity Scores (http://methodology.psu.edu/ra/causal/example) The Methodology Center, Penn State University SPSS and SAS macros for observed variable moderation, mediation, and conditional process modeling (http:// www.afhayes.com/introduction-to-mediation-moderation-and-conditional-process-analysis.html) Andrew F. Hayes, Ohio State University
Mental age
169
Mental age
Mental age is a concept in relation to intelligence, expressed as the age at which a child is performing intellectually. The mental age of the child that is tested is the same as the average age at which normal children achieve a particular score.[1] However, a mental age result on an intelligence test does not mean that children function at their "mental age level" in all aspects of life. For instance, a gifted six-year-old child can still in some ways function as a three-year-old child.[2] Mental age was once considered a controversial concept.[3]
References
[1] http:/ / www. apa. org/ research/ action/ glossary. aspx#m [2] L.K. Silverman, 1997. The construct of asynchronous development. Peabody Journal of Education, Vol. 72 Issue 3/4 [3] *Thurstone LL. The Mental Age Concept. (http:/ / www. brocku. ca/ MeadProject/ Thurstone/ Thurstone_1926. html) Psychological Review 33 (1926): 268-278. [4] http:/ / users. ipfw. edu/ abbott/ 120/ IntelligenceTests. html [5] http:/ / users. ipfw. edu/ abbott/ 120/ IntelligenceTests. html
Mental chronometry
170
Mental chronometry
Mental chronometry is the use of response time in perceptual-motor tasks to infer the content, duration, and temporal sequencing of cognitive operations. Mental chronometry is one of the core paradigms of experimental and cognitive psychology, and has found application in various disciplines including cognitive psychophysiology/cognitive neuroscience and behavioral neuroscience to elucidate mechanisms underlying cognitive processing. Mental chronometry is studied using the measurements of reaction time (RT). Reaction time is the elapsed time between the presentation of a sensory stimulus and the subsequent behavioral response. In psychometric psychology it is considered to be an index of speed of processing.[1] That is, it indicates how fast the thinker can execute the mental operations needed by the task at hand. In turn, speed of processing is considered an index of processing efficiency. The behavioral response is typically a button press but can also be an eye movement, a vocal response, or some other observable behavior.
Types
Response time is the sum of reaction time plus movement time. Usually the focus in research is on reaction time. There are four basic means of measuring it: Simple reaction time is the motion required for an observer to respond to the presence of a stimulus. For example, a subject might be asked to press a button as soon as a light or sound appears. Mean RT for college-age individuals is about 160 milliseconds to detect an auditory stimulus, and approximately 190 milliseconds to detect visual stimulus.[2] The mean reaction times for sprinters at the Beijing Olympics were 166 ms for males and 189 ms for females, but in one out of 1,000 starts they can achieve 109 ms and 121 ms, respectively.[3] Interestingly, that study concluded that longer female reaction times are an artifact of the measurement method used; a suitable lowering of the force threshold on the starting blocks for women would eliminate the sex difference. Recognition or Go/No-Go reaction time tasks require that the subject press a button when one stimulus type appears and withhold a response when another stimulus type appears. For example, the subject may have to press the button when a green light appears and not respond when a blue light appears. Choice reaction time (CRT) tasks require distinct responses for each possible class of stimulus. For example, the subject might be asked to press one button if a red light appears and a different button if a yellow light appears. The Jensen box is an example of an instrument designed to measure choice reaction time. Discrimination reaction time involves comparing pairs of simultaneously presented visual displays and then pressing one of two buttons according to which display appears brighter, longer, heavier, or greater in magnitude on some dimension of interest. Due to momentary attentional lapses, there is a considerable amount of variability in an individual's response time, which does not tend to follow a normal (Gaussian) distribution. To control for this, researchers typically require a subject to perform multiple trials, from which a measure of the 'typical' response time can be calculated. Taking the mean of the raw response time is rarely an effective method of characterizing the typical response time, and alternative approaches (such as modeling the entire response time distribution) are often more appropriate.[4]
Mental chronometry
171
Donders' experiment
The first scientist to measure reaction time in the laboratory was Franciscus Donders (1869). Donders found that simple reaction time is shorter than recognition reaction time, and that choice reaction time is longer than both.[2] Donders also devised a subtraction method to analyze the time it took for mental operations to take place.[6] By subtracting simple reaction time from choice reaction time, for example, it is possible to calculate how much time is needed to make the connection. This method provides a way to Donders (1868s): method of subtraction. Picture from the Historical Introduction to Cognitive Psychology webpage. investigate the cognitive processes underlying simple perceptual-motor tasks, and formed the basis of subsequent developments.[6] Although Donders' work paved the way for future research in mental chronometry tests, it was not without its drawbacks. His insertion method was based on the assumption that inserting a particular complicating requirement into an RT paradigm would not affect the other components of the test. This assumption - that the incremental effect on RT was strictly additive - was not able to hold up to later experimental tests, which showed that the insertions were able to interact with other portions of the RT paradigm. Despite this, Donders' theories are still of interest and his ideas are still used in certain areas of psychology, which now have the statistical tools to use them more accurately.[1]
Mental chronometry
172
Hick's Law
W. E. Hick (1952) devised a CRT experiment which presented a series of nine tests in which there are n equally possible choices. The experiment measured the subject's reaction time based on number of possible choices during any given trial. Hick showed that the individual's reaction time increased by a constant amount as a function of available choices, or the "uncertainty" involved in which reaction stimulus would appear next. Uncertainty is measured in "bits", which are defined as the quantity of information that reduces uncertainty by half in information theory. In Hick's experiment, the reaction time is found to be a function of the binary logarithm of the number of available choices (n). This phenomenon is called "Hick's Law" and is said to be a measure of the "rate of gain of information." The law is usually expressed by the formula , where and are constants representing the intercept and slope of the function, and is the number of alternatives.[7] The Jensen Box is a more recent application of Hick's Law.[1] Hick's Law has interesting modern applications in marketing, where restaurant menus and web interfaces (among other things) take advantage of its principles in striving to achieve [8] speed and ease of use for the consumer.
Sentence-picture verification
Mental chronometry has been used in identifying some of the processes associated with understanding a sentence. This type of research typically revolves around the differences in processing 4 types of sentences: true affirmative (TA), false affirmative (FA), false negative (FN), and true negative (TN). A picture can be presented with an associated sentence that falls into one of these 4 categories. The subject then decides if the sentence matches the picture or does not. The type of sentence determines how many processes need to be performed before a decision can
Mental chronometry be made. According to the data from Clark and Chase (1972) and Just and Carpenter (1971), the TA sentences are the simplest and take the least time, than FA, FN, and TN sentences.[13][14]
173
Mental chronometry
174
Mental chronometry
175
Other factors
Research has shown that reaction times may be improved by chewing gum: "The results showed that chewing gum was associated with greater alertness and a more positive mood. Reaction times were quicker in the gum condition, and this effect became bigger as the task became more difficult." [23]
Regions of the Brain Involved in a Number Comparison Task Derived from EEG and fMRI Studies. The regions represented correspond to those showing effects of notation used for the numbers (pink and hatched), distance from the test number (orange), choice of hand (red), and errors (purple). Picture from the article: Timing the Brain: Mental Chronometry as a Tool in Neuroscience.
In the 1960s, these methods were used extensively in humans: researchers recorded the electrical potentials in human brain using scalp electrodes while a reaction tasks was being conducted using digital computers. What they found was that there was a connection between the observed electrical potentials with motor and sensory stages for information processing. For example, researchers found in the recorded scalp potentials that the frontal cortex was being activated in association with motor activity. These finding can be connected to Donders idea of the subtractive method of the sensory and motor stages involved in reaction tasks. In the 1970s and early 1980s, development of signal processing tool for EEG translated into a revival of research using this technique to assess the timing and the speed of mental processes. For example, high-profile research showed how reaction time on a given trial correlated with the latency (delay between stimulus and response) of the P300 wave[25] or how the timecourse of the EEG reflected the sequence of cognitive processes involved in perceptual processing.[26] With the invention of functional magnetic resonance imaging (fMRI), techniques were used to measure activity through electrical event-related potentials in a study when subjects were asked to identify if a digit that was presented was above or below five. According to Sternbergs additive theory, each of the stages involved in
Mental chronometry performing this task includes: encoding, comparing against the stored representation for five, selecting a response, and then checking for error in the response.[27] The fMRI image presents the specific locations where these stages are occurring in the brain while performing this simple mental chronometry task. In the 1980s, neuroimaging experiments allowed researchers to detect the activity in localized brain areas by injecting radionuclides and using positron emission tomography (PET) to detect them. Also, fMRI was used which have detected the precise brain areas that are active during mental chronometry tasks. Many studies have shown that there is a small number of brain areas which are widely spread out which are involved in performing these cognitive tasks.
176
References
[1] Jensen, A. R. (2006). Clocking the mind: Mental chronometry and individual differences. Amsterdam: Elsevier. (ISBN 978-0-08-044939-5) [2] Kosinski, R. J. (2008). A literature review on reaction time, Clemson University. (http:/ / biae. clemson. edu/ bpc/ bp/ Lab/ 110/ reaction. htm#Type of Stimulus) [4] (http:/ / opensiuc. lib. siu. edu/ cgi/ viewcontent. cgi?article=1077& context=tpr) Whelan, R. (2008). Effective analysis of reaction time data. The Psychological Record, 58, 475-482. [6] Donders, F.C. (1869). On the speed of mental processes. In W. G. Koster (Ed.), Attention and Performance II. Acta Psychologica, 30, 412-431. (Original work published in 1868.) [7] Hick's Law at Encyclopedia.com (http:/ / www. encyclopedia. com/ doc/ 1O87-Hickslaw. html) Originally from Colman, A. (2001). A Dictionary of Psychology. Retrieved February 28, 2009. [8] W. Lidwell, K. Holden and J. Butler: Universal. Principles of Design. Rockport, Gloucester, MA, 2003. [12] Cooper, L. A., & Shepard, R. N. (1973). Chronometric studies of the rotation of mental images. New York: Academic Press. [17] Posner, M. I. (1978). Chronometric explorations of mind. Hillsdale, NJ: Erlbaum, 1978. [21] Demetriou, A., Mouyi, A., & Spanoudis, G. (2010). The development of mental processing. Nesselroade, J. R. (2010). Methods in the study of life-span human development: Issues and answers. In W. F. Overton (Ed.), Biology, cognition and methods across the life-span. Volume 1 of the Handbook of life-span development (pp. 36-55), Editor-in-chief: R. M. Lerner. Hoboken, NJ: Wiley. [23] Smith, A. (2009). Effects of chewing gum on mood, learning, memory and performance of an intelligence test. Nutritional Neuroscience, 12(2), 81
Further reading
Luce, R.D. (1986). Response Times: Their Role in Inferring Elementary Mental Organization. New York: Oxford University Press. ISBN0-19-503642-5. Meyer, D.E.; Osman, A.M.; Irwin, D.E.; Yantis, S. (1988). "Modern mental chronometry". Biological Psychology 26 (13): 367. doi: 10.1016/0301-0511(88)90013-0 (http://dx.doi.org/10.1016/0301-0511(88)90013-0). PMID 3061480 (http://www.ncbi.nlm.nih.gov/pubmed/3061480). Townsend, J.T.; Ashby, F.G. (1984). Stochastic Modeling of Elementary Psychological Processes. Cambridge, UK: Cambridge University Press. ISBN0-521-27433-8. Weiss, V; Weiss, H (2003). "The golden mean as clock cycle of brain waves" (http://www.v-weiss.de/chaos. html). Chaos, Solitons and Fractals 18 (4): 643652. Bibcode: 2003CSF....18..643W (http://adsabs.harvard. edu/abs/2003CSF....18..643W). doi: 10.1016/S0960-0779(03)00026-2 (http://dx.doi.org/10.1016/ S0960-0779(03)00026-2).
Mental chronometry
177
External links
Reaction Time Test (http://www.humanbenchmark.com/tests/reactiontime/index.php) - Measuring Mental Chronometry on the Web Historical Introduction to Cognitive Psychology (http://www.mtsu.edu/~sschmidt/Cognitive/intro/intro. html) Timing the Brain: Mental Chronometry as a Tool in Neuroscience (http://biology.plosjournals.org/perlserv/ ?request=get-document&doi=10.1371/journal.pbio.0030051) Sample Chronometric Test on the web (http://cognitivelabs.com/mydna_speedtestno.htm)
Moderated mediation
178
Moderated mediation
In statistics, moderation and mediation can occur together in the same model.[1] Moderated mediation, also known as conditional indirect effects,[2] occurs when the treatment effect of an independent variable A on an outcome variable C via a mediator variable B differs depending on levels of a moderator variable D. Specifically, either the effect of A on the B, and/or the effect of B on C depends on the level of D.
Moderated mediation
179
References
[1] Muller, D., Judd, C. M., & Yzerbyt, V. Y. (2005). When moderation is mediated and mediation is moderated. Journal of Personality and Social Psychology, 89, 852863. [2] Preacher, K. J., Rucker, D. D., & Hayes, A. F. (2007) Addressing moderated mediation hypotheses: Theory, Methods, and Prescriptions. Multivariate Behavioral Research, 42, 185227. [3] Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 11731182.
External links
SPSS and SAS macros for testing conditional indirect effects (http://www.afhayes.com/ spss-sas-and-mplus-macros-and-code.html)
Moderation (statistics)
180
Moderation (statistics)
In statistics and regression analysis, moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator.[] The effect of a moderating variable is characterized statistically as an interaction;[] that is, a qualitative (e.g., sex, race, class) or quantitative (e.g., level of reward) variable that affects the direction and/or strength of the relation between dependent and independent variables. Specifically within a correlational analysis framework, a moderator is a third variable that affects the zero-order correlation between two other variables. In analysis of variance (ANOVA) terms, a basic moderator effect can be represented as an interaction between a focal independent variable and a factor that specifies the appropriate conditions for its operation.[1]
Example
Moderation analysis in the behavioral sciences involves the use of linear multiple regression analysis or causal modelling.[] To quantify the effect of a moderating variable in multiple regression analyses, regressing random variable Y on X, an additional term is added to the model. This term is the interaction between X and the proposed moderating variable.[] Thus, for a response Y and two variables x1 and moderating variable x2,: In this case, the role of x2 as a moderating variable is accomplished by evaluating b3, the parameter estimate for the interaction term.[] See linear regression for discussion of statistical evaluation of parameter estimates in regression analyses.
Moderation (statistics) European American participants. To probe if there is any significant difference between European Americans and East Asians in the experimental condition, we can simply run the analysis with the condition variable reverse-coded (0 = experimental, 1 = control), so that the coefficient for ethnicity represents the ethnicity effect on Y in the experimental condition. In a similar vein, if we want to see whether the treatment has an effect for East Asian participants, we can reverse code the ethnicity variable (0 = East Asians, 1 = European Americans).
181
182
Moderation (statistics)
183
References
[1] Baron, R. M., & Kenny, D. A. (1986). "The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations", Journal of Personality and Social Psychology, 5 (6), 11731182 (page 1174)
Hayes, A. F., & Matthes, J. (2009). "Computational procedures for probing interactions in OLS and logistic regression: SPSS and SAS implementations." Behavior Research Methods, Vol. 41, pp.924936.
Multidimensional scaling
Multidimensional scaling (MDS) is a set of related statistical techniques often used in information visualization for exploring similarities or dissimilarities in data. MDS is a special case of ordination. An MDS algorithm starts with a matrix of itemitem similarities, then assigns a location to each item in N-dimensional space, where N is specified a priori. For sufficiently small N, the resulting locations may be displayed in a graph or 2D visualization techniques such as scatterplots.
Types
MDS algorithms fall into a taxonomy, depending on the meaning of the input matrix: Classical multidimensional scaling Also known as Principal Coordinates Analysis, Torgerson Scaling or TorgersonGower scaling. Takes an input matrix giving dissimilarities between pairs of items and outputs a coordinate matrix whose configuration minimizes a loss function called strain.[] Metric multidimensional scaling A superset of classical MDS that generalizes the optimization procedure to a variety of loss functions and input matrices of known distances with weights and so on. A useful loss function in this context is called stress, which is often minimized using a procedure called stress majorization. Non-metric multidimensional scalingLouis Guttman's smallest space analysis (SSA) is an example of a non-metric MDS procedure. In contrast to metric MDS, non-metric MDS finds both a non-parametric monotonic relationship between the dissimilarities in the item-item matrix and the Euclidean distances between items, and the location of each item in the low-dimensional space. The relationship is typically found using isotonic regression. Generalized multidimensional scaling An extension of metric multidimensional scaling, in which the target space is an arbitrary smooth non-Euclidean space. In cases where the dissimilarities are distances on a surface and the target space is another surface, GMDS allows finding the minimum-distortion embedding of one surface into another.[]
Multidimensional scaling
184
Details
The data to be analyzed is a collection of defined, i,j := distance between i th and j th objects. These distances are the entries of the dissimilarity matrix objects (colors, faces, stocks,...) on which a distance function is
vectors ,
such that
is a vector norm. In classical MDS, this norm is the Euclidean distance, but, in a broader sense, it may objects into RN such that distances are preserved. If
be a metric or arbitrary distance function.[1] In other words, MDS attempts to find an embedding from the
the dimension N is chosen to be 2 or 3, we may plot the vectors xi to obtain a visualization of the similarities between the objects. Note that the vectors xi are not unique: With the Euclidean distance, they may be arbitrarily translated, rotated, and reflected, since these transformations do not change the pairwise distances . There are various approaches to determining the vectors xi. Usually, MDS is formulated as an optimization problem, where is found as a minimizer of some cost function, for example,
A solution may then be found by numerical optimization techniques. For some particularly chosen cost functions, minimizers can be stated analytically in terms of matrix eigendecompositions.[citation needed]
Procedure
There are several steps in conducting MDS research: 1. Formulating the problem What variables do you want to compare? How many variables do you want to compare? More than 20 is often considered cumbersome. [citation needed] Fewer than 8 (4 pairs) will not give valid results. [citation needed] What purpose is the study to be used for? 2. Obtaining input data Respondents are asked a series of questions. For each product pair, they are asked to rate similarity (usually on a 7 point Likert scale from very similar to very dissimilar). The first question could be for Coke/Pepsi for example, the next for Coke/Hires rootbeer, the next for Pepsi/Dr Pepper, the next for Dr Pepper/Hires rootbeer, etc. The number of questions is a function of the number of brands and can be calculated as where Q is the number of questions and N is the number of brands. This approach is referred to as the Perception data : direct approach. There are two other approaches. There is the Perception data : derived approach in which products are decomposed into attributes that are rated on a semantic differential scale. The other is the Preference data approach in which respondents are asked their preference rather than similarity. 3. Running the MDS statistical program Software for running the procedure is available in many software for statistics. Often there is a choice between Metric MDS (which deals with interval or ratio level data), and Nonmetric MDS (which deals with ordinal data). 4. Decide number of dimensions The researcher must decide on the number of dimensions they want the computer to create. The more dimensions, the better the statistical fit, but the more difficult it is to interpret the results.
Multidimensional scaling 5. Mapping the results and defining the dimensions The statistical program (or a related module) will map the results. The map will plot each product (usually in two-dimensional space). The proximity of products to each other indicate either how similar they are or how preferred they are, depending on which approach was used. How the dimensions of the embedding actually correspond to dimensions of system behavior, however, are not necessarily obvious. Here, a subjective judgment about the correspondence can be made (see perceptual mapping). 6. Test the results for reliability and validity Compute R-squared to determine what proportion of variance of the scaled data can be accounted for by the MDS procedure. An R-square of 0.6 is considered the minimum acceptable level. [citation needed] An R-square of 0.8 is considered good for metric scaling and .9 is considered good for non-metric scaling. Other possible tests are Kruskals Stress, split data tests, data stability tests (i.e., eliminating one brand), and test-retest reliability. 7. Report the results comprehensively Along with the mapping, at least distance measure (e.g., Sorenson index, Jaccard index) and reliability (e.g., stress value) should be given. It is also very advisable to give the algorithm (e.g., Kruskal, Mather), which is often defined by the program used (sometimes replacing the algorithm report), if you have given a start configuration or had a random choice, the number of runs, the assessment of dimensionality, the Monte Carlo method results, the number of iterations, the assessment of stability, and the proportional variance of each axis (r-square).
185
Applications
Applications include scientific visualisation and data mining in fields such as cognitive science, information science, psychophysics, psychometrics, marketing and ecology. New applications arise in the scope of autonomous wireless nodes that populate a space or an area. MDS may apply as a real time enhanced approach to monitoring and managing such populations. Furthermore, MDS has been used extensively in geostatistics for modeling the spatial variability of the patterns of an image, by representing them as points in a lower-dimensional space.[2]
Marketing
In marketing, MDS is a statistical technique for taking the preferences and perceptions of respondents and representing them on a visual grid, called perceptual maps.
Multidimensional scaling
186
Implementations
cmdscale in R NMS in PC-ORD, Multivariate Analysis of Ecological Data [3] Orange, a free data mining software suite, module orngMDS [4] ViSta [5] has implementations of MDS by Forrest W. Young. Interactive graphics allow exploring the results of MDS in detail. usabiliTEST's Online Card Sorting [6] software is utilizing MDS to plot the data collected from the participants of usability tests.
Bibliography
[1] Kruskal, J. B., and Wish, M. (1978), Multidimensional Scaling, Sage University Paper series on Quantitative Application in the Social Sciences, 07-011. Beverly Hills and London: Sage Publications. [2] Honarkhah, M and Caers, J, 2010, Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling (http:/ / dx. doi. org/ 10. 1007/ s11004-010-9276-7), Mathematical Geosciences, 42: 487517 [3] http:/ / www. pcord. com [4] http:/ / www. ailab. si/ orange/ doc/ modules/ orngMDS. htm [5] http:/ / www. uv. es/ visualstats/ Book [6] http:/ / www. usabilitest. com/ CardSorting
Cox, T.F., Cox, M.A.A. (2001). Multidimensional Scaling. Chapman and Hall. Coxon, Anthony P.M. (1982). The User's Guide to Multidimensional Scaling. With special reference to the MDS(X) library of Computer Programs. London: Heinemann Educational Books. Green, P. (January 1975). "Marketing applications of MDS: Assessment and outlook". Journal of Marketing 39 (1): 2431. doi: 10.2307/1250799 (http://dx.doi.org/10.2307/1250799). McCune, B. and Grace, J.B. (2002). Analysis of Ecological Communities. Oregon, Gleneden Beach: MjM Software Design. ISBN0-9721290-0-6. Torgerson, Warren S. (1958). Theory & Methods of Scaling. New York: Wiley. ISBN0-89874-722-8.
External links
An elementary introduction to multidimensional scaling (http://www.mathpsyc.uni-bonn.de/doc/delbeke/ delbeke.htm) NewMDSX: Multidimensional Scaling Software (http://www.newmdsx.com/) MDS page (http://www.granular.com/MDS/) MDS in C++ (http://codingplayground.blogspot.com/2009/05/multidimension-scaling.html) by Antonio Gulli The orngMDS module (http://orange.biolab.si/doc/modules/orngMDS.htm) for MDS from Orange (software)
187
Introduction
Interviews have been used widely for different purposes, including assessment and recruitment. Candidate assessment is normally deemed successful when the scores generated by the measuring tool predict for future outcomes of interest, such as job performance or job retention. Meta-analysis of the human resource literature has demonstrated low to moderate ability of interviews to predict for future job performance.[2] How well a candidate scores on one interview is only somewhat correlated with how well that candidate scores on the next interview. Marked shifts in scores are buffered when collecting many scores on the same candidate, with a greater buffering effect provided by multiple interviews than by multiple interviewers acting as a panel for one interview.[3] The score assigned by an interviewer in the first few minutes of an interview is rarely changed significantly over the course of the rest of the interview, an effect known as the halo effect. Therefore, even very short interviews within an MMI format provide similar ability to differentiate reproducibly between candidates.[4] Ability to reproducibly differentiate between candidates, also known as overall test reliability, is markedly higher for the MMI than for other interview formats.[1] This has translated into higher predictive validity, correlating for future performance much more highly than standard interviews.[5][6][7][8]
History
Aiming to enhance predictive correlations with future performance in medical school, post-graduate medical training, and future performance in practice, McMaster University began research and development of the MMI in 2001. The initial pilot was conducted on 18 graduate students volunteering as medical school candidates. High overall test reliability (0.81) led to a larger study conducted in 2002 on real medical school candidates, many of whom volunteered after their standard interview to stay for the MMI. Overall test reliability remained high,[1] and subsequent follow-up through medical school and on to national licensure examination (Medical Council of Canada [9] Qualifying Examination Parts I and II) revealed the MMI to be the best predictor for subsequent clinical performance,[5][7] professionalism,[6] and ability to communicate with patients and successfully obtain national licensure.[7][8] Since its formal inception at the Michael G. DeGroote School of Medicine at McMaster University in 2004, the MMI subsequently spread as an admissions test across medical schools, and to other disciplines. By 2008, the MMI was being used as an admissions test by the majority of medical schools in Canada, Australia and Israel, as well as other medical schools in the United States and Brunei. This success lead to the development of a McMaster spin-off company, APT Inc., to commercialize the MMI system. The MMI was branded as ProFitHR [10] and made available to both the academic and corporate sector.[11] By 2009, the list of other disciplines using the MMI included schools for dentistry, pharmacy, midwifery, physiotherapy and occupational therapy, veterinary medicine, ultrasound technology, nuclear medicine technology, X-ray technology, medical laboratory technology, chiropody, dental hygiene, and postgraduate training programs in dentistry and medicine.
188
MMI Procedure
1. Interview stations the domain(s) being assessed at any one station are variable, and normally reflects the objectives of the selecting institution. Examples of domains include the soft skills - ethics, professionalism, interpersonal relationships, ability to manage, communicate, collaborate, as well as perform a task. An MMI interview station takes considerable time and effort to produce; it is composed of several parts, including the stem question, probing questions for the interviewer, and a scoring sheet. 2. Circuit(s) of stations to reduce costs of the MMI significantly below that of most interviews,[12] the interview stations are kept short (eight minutes or less) and are conducted simultaneously in a circuit as a bell-ringer examination. The preferred number of stations depends to some extent on the characteristics of the candidate group being interviewed, though nine interviews per candidate represents a reasonable minimum.[3] The circuit of interview stations should be within sufficiently close quarters to allow candidates to move from interview room to interview room. Multiple parallel circuits can be run, each circuit with the same set of interview stations, depending upon physical plant limitations. 3. Interviewers one interviewer per interview station is sufficient.[3] In a typical MMI, each interviewer stays in the same interview throughout, as candidates rotate through. The interviewer thus scores each candidate based upon the same interview scenario throughout the course of the test. 4. Candidates each candidate rotates through the circuit of interviews. For example, if each interview station is eight minutes, and there are nine interview stations, it will take the nine candidates being assessed on that circuit 72 minutes to complete the MMI. Each of the candidates begins at a different interview station, rotating to the next interview station at the ringing of the bell. 5. Administrators each circuit requires at least one administrator to ensure that the MMI is conducted fairly and on time.
References
[1] Eva KW, Reiter HI, Rosenfeld J, Norman GR. An admissions OSCE: the multiple mini-interview. Medical Education, 38:314-326 (2004). [2] Barrick MR, Mount MK. The Big 5 personality dimensions and job performance: a meta-analysis. Personnel Psychology 1991, 44:1-26. [3] Eva KW, Reiter HI, Rosenfeld J, Norman GR. The relationship between interviewer characteristics and ratings assigned during a Multiple Mini-Interview. Academic Medicine, 2004 Jun; 79(6):602.9. [4] Dodson M, Crotty B, Prideaux D, Carne R, Ward A, de Leeuw E. The multiple mini-interview: how long is long enough? Med Educ. 2009 Feb;43(2):168-74. [5] Eva KW, Reiter HI, Rosenfeld J, Norman GR. The ability of the Multiple Mini-Interview to predict pre-clerkship performance in medical school. Academic Medicine, 2004, Oct; 79(10 Suppl): S40-2. [6] Reiter HI, Eva KW, Rosenfeld J, Norman GR. Multiple Mini-Interview Predicts for Clinical Clerkship Performance, National Licensure Examination Performance. Med Educ. 2007 Apr;41(4):378-84. [7] Eva KW, Reiter HI, Trinh K, Wasi P, Rosenfeld J, Norman GR. Predictive validity of the multiple mini-interview for selecting medical trainees. Accepted for publication January 2009 in Medical Education. [8] Hofmeister M, Lockyer J, Crutcher R. The multiple mini-interview for selection of international medical graduates into family medicine residency education. Med Educ. 2009 Jun;43(6):573-9. [9] http:/ / www. mcc. ca/ [10] http:/ / www. profithr. com/ [11] www.ProFitHR.com [12] Rosenfeld J, Eva KW, Reiter HI, Trinh K. A Cost-Efficiency Comparison between the Multiple Mini-Interview and Panel-based Admissions Interviews. Advanced Health Science Education Theory Pract. 2008 Mar;13(1):43-58
189
Multistage testing
Multistage testing is an algorithm-based approach to administering tests. It is very similar to computer-adaptive testing in that items are interactively selected for each examinee by the algorithm, but rather than selecting individual items, groups of items are selected, building the test in stages. These groups are called testlets or panels.[1] While multistage tests could theoretically be administered by a human, the extensive computations required (often using item response theory) mean that multistage tests are administered by computer. The number of stages or testlets can vary. If the testlets are relatively small, such as five items, ten or more could easily be used in a test. Some multistage tests are designed with the minimum of two stages (one stage would be a conventional fixed-form test).[2] In response to the increasing use of multistage testing, the scholarly journal Applied Measurement in Education published a special edition on the topic in 2006.[3]
References
[1] Luecht, R. M. & Nungester, R. J. (1998). "Some practical examples of computer-adaptive sequential testing." Journal of Educational Measurement, 35, 229-249. [2] Castle, R.A. (1997). "The Relative Efficiency of Two-Stage Testing Versus Traditional Multiple Choice Testing Using Item Response Theory in Licensure." Unpublished doctoral dissertation. (http:/ / dwb. unl. edu/ Diss/ RCastle/ ReedCastleDiss. html) [3] Applied Measurement in Education edition on multistage testing (http:/ / www. leaonline. com/ toc/ ame/ 19/ 3)
Multitrait-multimethod matrix
190
Multitrait-multimethod matrix
The multitrait-multimethod (MTMM) matrix is an approach to examining Construct Validity developed by Campbell and Fiske(1959).[1] There are six major considerations when examining a construct's validity through the MTMM matrix, which are as follows: 1. Evaluation of convergent validity Tests designed to measure the same construct should correlate highly amongst themselves. 2. Evaluation of discriminant (divergent) validity The construct being measured by a test should not correlate highly with different constructs. 3. Trait-method unit- Each task or test used in measuring a construct is considered a trait-method unit; in that the variance contained in the measure is part trait, and part method. Generally, researchers desire low method specific variance and high trait variance. 4. Multitrait-multimethod More than one trait and more than one method must be used to establish (a) discriminant validity and (b) the relative contributions of the trait or method specific variance. This tenet is consistent with the ideas proposed in Platt's concept of Strong inference (1964).[2] 5. Truly different methodology When using multiple methods, one must consider how different the actual measures are. For instance, delivering two self report measures are not truly different measures; whereas using an interview scale or a psychosomatic reading would be. 6. Trait characteristics Traits should be different enough to be distinct, but similar enough to be worth examining in the MTMM.
Psychology
Basic types
Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social
Applied psychology
Multitrait-multimethod matrix
191
Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport
Lists
Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal
Multitrait
Multiple traits are used in this approach to examine (a) similar or (b) dissimilar traits, as to establish convergent and discriminant validity amongst traits.
Multimethod
Similarly, multiple methods are used in this approach to examine the differential effects (or lack thereof) caused by method specific variance.
Example
The example below provides a prototypical matrix and what the correlations between measures mean. The diagonal line is typically filled in with a reliability coefficient of the measure (e.g. alpha coefficient). Descriptions in brackets [] indicate what is expected when the validity of the construct (e.g., depression or anxiety) and the validities of the measures are all high.
Multitrait-multimethod matrix
192
Test
BDI
(Reliability Coefficient) [close to 1.00] (Reliability Coefficient) [close to 1.00] Heteromethod-heterotrait [lowest of all] Monomethod-heterotrait [low, less than monotrait] (Reliability Coefficient) [close to 1.00] Heteromethod-monotrait [highest of all except reliability] (Reliability Coefficient) [close to 1.00]
HDIv Heteromethod-monotrait [highest of all except reliability] BAI Monomethod-heterotrait [low, less than monotrait]
In this example the first row and the first column display the trait being assessed (i.e. anxiety or depression) as well as the method of assessing this trait (i.e. interview or survey as measured by fictitious measures). The term heteromethod indicates that in this cell the correlation between two separate methods is being reported. Monomethod indicates the opposite, in that the same method is being used (e.g. interview, interview). Heterotrait indicates that the cell is reporting two supposedly different traits. Monotrait indicates the opposite- that the same trait is being used. In evaluating an actual matrix one wishes to examine the proportion of variance shared amongst traits and methods as to establish a sense of how much method specific variance is induced by the measurement method, as well as provide a look at how unique the trait is, as compared to another trait. That is, for example, the trait should matter more than the specific method of measuring. For example, if a person is measured as being highly depressed by one measure, then another type of measure should also indicate that the person is highly depressed. On the other hand, people who appear highly depressed on the Beck Depression Inventory should not necessarily get high anxiety scores on Beck's Anxiety Inventory. Since the inventories were written by the same person, and are similar in style, there might be some correlation, but this similarity in method should not affect the scores much, so the correlations between these measures of different traits should be low.
Multitrait-multimethod matrix
193
References
[1] Campbell, D.T., & FiskeD.W. (1959) Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105 " [2] John R. Platt (1964). "Strong inference". Science 146 (3642). [3] Figueredo, A., Ferketich, S., Knapp, T. (1991). Focus on psychometrics: More on MTMM: The Role of Confirmatory Factor Analysis. Nursing & Health, 14, 387-391 [4] Sawilowsky, S. (2002). A quick distribution-free test for trend that contributes evidence of construct validity. Measurement and Evaluation in Counseling and Development, 35, 78-88. [5] Cuzzocrea, J., & Sawilowsky, S. (2009). Robustness to non-independence and power of the I test for trend in construct validity. Journal of Modern Applied Statistical Methods, 8(1), 215-225.
Basic types
Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social
Applied psychology
Applied behavior analysis Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military
194
Occupational health Political Religion School Sport
Lists
Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal
Jean Piaget's theory of cognitive development has been criticized on many grounds. One criticism is concerned with the very nature of development itself. It is suggested that Piaget's theory does not explain why development from stage to stage occurs. The theory is also criticized for ignoring individual differences in cognitive development. That is, the theory does not account for the fact that some individuals move from stage to stage faster than other individuals. Finally, another criticism is concerned with the nature of stages themselves. Research shows that the functioning of a person at a given age may be so variable from domain to domain, such as the understanding of social, mathematical, and spatial concepts, that it is not possible to place the person in a single stage.[1] To remove these weaknesses, a group of researchers, who are known as neo-Piagetian theorists, advanced models that integrate concepts from Piaget's theory with concepts from cognitive and differential psychology.[2][3][4][5]
Neo-Piagetian theories of cognitive development formal, early formal, and late formal thought require a mental power of 1, 2, 3, 4, 5, 6, and 7 mental units, respectively. Having a lesser degree of mental power than required by a task makes the solution of this task impossible, because the necessary relations cannot be represented and computed. Thus, each increase in mental power with age opens the way for the construction of concepts and skills up to the new level of capacity. Falling short or exceeding the mental power that is typical of a given age results in slower or faster rates of development, respectively.
195
196
Neo-Piagetian theories of cognitive development been explained above. The ability to solve simple arithmetic problems, where one term is missing, such as "3 + ? = 8" or "4 ? 2 = 8" also depends on system mappings, because all three known factors given must be considered simultaneously if the missing element or operation is to be specified. At the final level multiple-system mappings can be constructed. At this level quaternary relations or relations between binary operations can be constructed. For example, problems with two unknowns (e.g., 2 ? 2 ? 4 = 4) or problems of proportionality, can be solved. That is, at this level four dimensions can be considered at once. The four levels of structure mappings are thought to be attainable at the age of 1, 3, 5, and 10 years, respectively, and they correspond, in the theory of cognitive development of Piaget, to the sensorimotor, the preoperational, the concrete operational, and the formal operational, or Case's sensorimotor, interrelational, dimensional, and vectorial stage, respectively.
197
198
Processing potentials
Mental functioning at any moment occurs under the constraints of the processing potentials that are available at a given age. Processing potentials are specified in terms of three dimensions: speed of processing, control of processing, and representational capacity. Speed of processing refers to the maximum speed at which a given mental act may be efficiently executed. It is measured in reference to the reaction time to very simple tasks, such as the time needed to recognize an object. Control of processing involves executive functions that enable the person to keep the mind focused on a goal, protect attention of being captured by irrelevant stimuli, timely shift focus to other relevant information if required, and inhibit irrelevant or premature responses, so that a strategic plan of action can be made and sustained. Reaction time to situations where one must choose between two or more alternatives is one measure of control of processing. Stroop effect tasks are good measures of control of processing. Representational capacity refers to the various aspects of mental power or working memory
Figure 1: The general model of the architecture of the developing mind integrating concepts from the theories of Demetriou and Case.
mentioned above.[13]
Neo-Piagetian theories of cognitive development is unacceptable in human relations. Table 1 summarizes the core processes, mental operations, and concepts that are typical of each domain. The domain specificity of these systems implies that the mental processes differ from the one system to the other. Compare, for instance, arithmetic operations in the quantitative system with mental rotation in the spatial system. The first require the thinker to relate quantities; the other require the transformation of the orientation of an object in space. Moreover, the different systems require different kinds of symbols to represent and operate on their objects. Compare, for instance, mathematical symbolism in the quantitative system with mental images in the spatial system. Obviously, these differences make it difficult to equate the concepts and operations across the various systems in the mental load they impose on representational capacity, as the models above assume. Case (1992) also recognized that different types of problem domains, such as the domain of social, mathematical, and spatial thought, may have a different kind of central conceptual Table 1: The three levels of organization of each structure. That is, concepts and executive control structures differ specialized system of thought across domains in the semantic networks that they involve.[15] As a result, development over different concepts within domains may proceed in parallel but it may be uneven across domains. In fact, Case and Demetriou worked together to unify their analysis of domains. That is, they suggested that Demetriou's domains may be specified in terms of Case's central conceptual structures.[16]
199
Hypercognition
The third level includes functions and processes oriented to monitoring, representing, and regulating the environment-oriented systems. The input to this level is information arising from the functioning of processing potentials and the environment-oriented systems, for example, sensations, feelings, and conceptions caused by mental activity. The term hypercognition was used to refer to this level and denote the effects that it exerts on the other two levels of the mind. Hypercognition involves two central functions, namely working hypercognition and long-term hypercognition. Working hypercognition is a strong directive-executive function that is responsible for setting and pursuing mental and behavioral goals until they are attained. This function involves processes enabling the person to: (1) set mental and behavioral goals; (2) plan their attainment; (3) evaluate each step's processing demands vis--vis the available potentials, knowledge, skills and strategies; (4) monitor planned activities vis--vis the goals; and (5) evaluate the outcome attained. These processes operate recursively in such a way that goals and subgoals may be renewed according to the online evaluation of the system's distance from its ultimate objective. These regulatory functions operate under the current structural constraints of the mind that define the current processing potentials.[14][17] Recent research suggests that these processes participate in general intelligence together with processing potentials and the general inferential processes used by the specialized thought domains described above.[18] Consciousness is an integral part of the hypercognitive system. The very process of setting mental goals, planning their attainment, monitoring action vis--vis both the goals and the plans, and regulating real or mental action requires a system that can remember and review and therefore know itself. Therefore, conscious awareness and all ensuing functions, such as a self-concept (i.e., awareness of one's own mental characteristics, functions, and mental states) and a theory of mind (i.e., awareness of others' mental functions and states) are part of the very construction of the system. In fact, long-term hypercognition gradually builds maps or models of mental functions which are continuously updated. These maps are generally accurate representations of the actual organization of cognitive processes in the
Neo-Piagetian theories of cognitive development domains mentioned above.[14][18][19] When needed, they can be used to guide problem solving and understanding in the future. Optimum performance at any time depends on the interaction between actual problem solving processes specific to a domain and our representations of them. The interaction between the two levels of mind ensures flexibility of behavior, because the self-oriented level provides the possibility for representing alternative environment-oriented representations and actions and thus it provides the possibility for planning.[14][18]
200
Development
All of the processes mentioned above develop systematically with age. Speed of processing increases systematically from early childhood to middle age and it then starts to decrease again. For instance, to recognize a very simple object takes about 750 milliseconds at the age of 6 years and only about 450 milliseconds in early adulthood. Control of processing also becomes more efficient and capable of allowing the person to focus on more complex information, hold attention for longer periods of time, and alternate between increasingly larger stacks of stimuli and responses while filtering out irrelevant information. For instance, to recognize a particular stimulus among conflicting information may take about 2000 milliseconds at the age of 6 years and only about 750 milliseconds in early adulthood.[20] All components of working memory (e.g., executive functions, numerical, phonological and visuospatial storage) increase with age.[13][20] However, the exact capacity of working memory varies greatly depending upon the nature of information. For example, in the spatial domain, they may vary from 3 units at the age of six to 5 units at the age of 12 years. In the domain of mathematical thought, they may vary from about 2 to about 4 units in the same age period. If executive operations are required, the capacity is extensively limited, varying from about 1 unit at 6 to about 3 units at 12 years of age. Demetriou proposed the functional shift model to account for these data.[19] This model presumes that when the mental units of a given level reach a maximum degree of complexity, the mind tends to reorganize these units at a higher level of representation or integration so as to make them more manageable. Having created a new mental unit, the mind prefers to work with this rather than the previous units due to its functional advantages. An example in the verbal domain would be the shift from words to sentences and in the quantitative domain from natural numbers to algebraic representations of numerical relations. The functional shift models explains how new units are created leading to stage change in the fashion described by Case[7] and Halford.[21] The specialized domains develop through the life span both in terms of general trends and in terms of the typical characteristics of each domain. In the age span from birth to middle adolescence, the changes are faster in all of the domains. With development, thought in each of the domains becomes able to deal with increasingly more representations. Moreover, representations become increasingly interconnected with each other and they acquire their meaning from their interrelations rather than simply their relations with concrete objects. As a result, concepts in each of the domains become increasingly defined in reference to rules and general principles bridging more local concepts and creating new, broader, and more abstract concepts. Moreover, understanding and problem solving in each of the domains evolve from global and less integrated to differentiated, but better integrated, mental operations. As a result, planning and operation from alternatives becomes increasingly part of the person's functioning, as well as the increasing ability to efficiently monitor the problem solving process. This offers flexibility in cognitive functioning and problem solving across the whole spectrum of specialized domains. Table 2 summarizes the development of the domains from early childhood to adolescence.
201
In the hypercognitive system, self-awareness and self-regulation, that is, the ability to regulate one's own cognitive activity, develop systematically with age. Specifically, with development, self-awareness of cognitive processes becomes more accurate and shifts from the external and superficial characteristics of problems (e.g., this is about numbers and this is about pictures) to the cognitive processes involved (e.g., the one requires addition and the other requires mental rotation). Moreover, self-representations: (i) involve more dimensions which are better integrated into increasingly more complex structures; (ii) move along a concrete (e.g., I am fast and strong) to abstract (e.g., I am able) continuum so that they become increasingly more abstract and flexible; and (iii) become more accurate Table 2: Modal characteristics of the specialized in regard to the actual characteristics and abilities to which they refer domains with development (i.e., persons know where they are cognitively strong and where they are weak). The knowledge available at each phase defines the kind of self-regulation that can be affected. Thus, self-regulation becomes increasingly focused, refined, efficient, and strategic. Practically this implies that our information processing capabilities come under increasing a priori control of our long-term hypercognitive maps and our self-definitions.[17] Moreover, as we move into middle age, intellectual development gradually shifts from the dominance of systems that are oriented to the processing of the environment (such as spatial and propositional reasoning) to systems that require social support and self-understanding and management (social understanding). Thus, the transition to mature adulthood makes persons intellectually stronger and more self-aware of their strengths.[22] There are strong developmental relations between the various processes, such that changes at any level of organization of the mind open the way for changes in other levels. Specifically, changes in speed of processing open the way for changes in the various forms of control of processing. These, in turn, open the way for the enhancement of working memory capacity, which subsequently opens the way for development in inferential processes, and the development of the various specialized domains through the reorganization of domain-specific skills, strategies, and knowledge and the acquisition of new ones.[20] There are top-down effects as well. That is, general inference patterns, such as implication (if ... then inferences), or disjunction (either ... or inferences), are constructed by mapping domain-specific inference patterns onto each other through the hypercognitive process of metarepresentation. Metarepresentation is the primary top-down mechanism of cognitive change which looks for, codifies, and typifies similarities between mental experiences (past or present) to enhance understanding and problem-solving efficiency. In logical terms, metarepresentation is analogical reasoning applied to mental experiences or operations, rather than to representations of environmental stimuli. For example, if ... then sentences are heard over many different occasions in everyday language: if you are a good child then I will give you a toy; if it rains and you stay out then you become wet; if the glass falls on the floor then it brakes in pieces; etc. When a child realizes that the sequencing of the if ... then connectives in language is associated with situations in which the event or thing specified by if always comes first and it leads to the event or thing specified by then, this child is actually formulating the inference schema of implication. With development, the schema becomes a reasoning frame for predictions and interpretations of actual events or conversations about them.[3]
202
203
Neo-Piagetian theories of cognitive development Gardner's theory of multiple intelligences, which underestimates the operation of common processes.[30]
204
Neo-Piagetian theories of cognitive development underlying intra- and inter-individual differences could be educationally useful, because it highlights why the same student is not an equally good learner in different domains, and why different students in the same classroom react differently to the same instructional materials. For instance, differences between same age students in the same classroom in processing efficiency and working memory may differentiate these students in their understanding and mastering of the concepts or skills taught at a given moment. That is, students falling behind the demands would most probably have problems in capturing the concepts and skills taught. Thus, knowing the students' potentials in this regard would enable the teacher to develop individual examples of the target concepts and skills that would cater for the needs of the different students so that no one is left behind. Also, differences in the developmental condition, experience, familiarity, or interest in respect to the various domains would most certainly cause differences in how students would respond to teaching related to them. This is equally true for both differences between students and differences within the same student. In Case's terms, the central conceptual structures available in different domains would not necessarily match the complexity of executive control structures that are possible based on the students' processing and representational capacity. As a result, teaching would have to accommodate these differences if it is to lead each of the students to the optimum of their possibilities across all domains. Finally, identifying individual differences with regard to the various aspects of cognitive development could be the basis for the development of programs of individualized instruction which may focus on the gifted student or which may be of a remedial nature.[35][37] The discussion here about the educational implications of the neo-Piagetian theories of cognitive development taken as whole suggests that these theories provide a frame for designing educational interventions that is more focused and specific than traditional theories of cognitive development, such as the theory of Piaget, or theories of intelligence, such as the theories discussed above. Of course, much research is still needed for the proper application of these theories into the various aspects of education.
205
References
[1] Greenberg, D. (1987). Chapter 19, Learning (http:/ / books. google. co. il/ books?id=es2nOuZE0rAC& pg=PA91& lpg=PA91& dq="Learning"+ Greenberg+ Free+ at+ Last+ Learning+ -+ The+ Sudbury+ Valley+ School& source=bl& ots=TkL0NkwkBG& sig=aTvBo6l-92OZUeeW5tPB4-Nr0m8& hl=en& ei=IEn-SorsDJ2wnQOWuvTzCw& sa=X& oi=book_result& ct=result& resnum=8& ved=0CBwQ6AEwBw#v=onepage& q=& f=false), Free at Last, The Sudbury Valley School. The experience of Sudbury model schools shows that a great variety can be found in the minds of children, against Piaget's theory of universal steps in comprehension and general patterns in the acquisition of knowledge: "No two kids ever take the same path. Few are remotely similar. Each child is so unique, so exceptional" (Greenberg, 1987). Retrieved June 26, 2010. [2] Demetriou, A. (1998). Cognitive development. In A. Demetriou, W. Doise, K. F. M. van Lieshout (Eds.), Life-span developmental psychology (pp. 179-269). London: Wiley. [3] Demetriou, A., Mouyi, A., & Spanoudis, G. (2010). The development of mental processing. Nesselroade, J. R. (2010). Methods in the study of life-span human development: Issues and answers. In W. F. Overton (Ed.), Biology, cognition and methods across the life-span. Volume 1 of the Handbook of life-span development (pp. 306-343), Editor-in-chief: R. M. Lerner. Hoboken, NJ: Wiley. [4] Demetriou, A. (2006). Neo-Piagetische Ansatze. In W. Sneider & F. Wilkening (Eds.),Theorien, modelle, und methoden der Endwicklungpsychologie. Volume of Enzyklopadie der Psychologie (pp. 191-263): Gotingen: Hogrefe-Verlag. [5] Mora, S. (2007). Cognitive development: Neo-Piagetian perspectives. London: Psychology Press. [6] Pascual-Leone, J. (1970). A mathematical model for the transition rule in Piagets developmental stages. Acta Psychologica, 32, 301-345. [7] Case, R. (1985). Intellectual development. Birth to adulthood. New York: Academic Press. [8] Case, R., Okamoto, Y., Griffin, S., McKeough, A., Bleiker, C., Henderson, B., & Stephenson, K. M. (1996). The role of central conceptual structures in the development of childrens thought. Monographs of the Society for Research in Child Development, 61 (1-2, Serial No. 246). [9] Case, R. (1992). The minds staircase: Exploring the conceptual underpinnings of childrens thought and knowledge. Hillsdale, NJ: Erlbaum. [10] Halford, G. S. (1993). Childrens understanding: The development of mental models. Hillsdale, NJ: Erlbaum. [11] Fischer, K. W. (1980). A theory of cognitive development: The control and construction of hierarchies of skills. Psychological Review, 87, 477-531. [12] Vygotsky, L. S. (1962). Thought and language. Cambridge, MA: MIT Press. [13] Demetriou, A., Christou, C., Spanoudis, G., & Platsidou, M. (2002). The development of mental processing: Efficiency, working memory, and thinking. Monographs of the Society of Research in Child Development, 67, Serial Number 268. [14] Demetriou, A., & Kazi, S. (2001). Unity and modularity in the mind and the self: Studies on the relationships between self-awareness, personality, and intellectual development from childhood to adolescence. London: Routledge.
206
207
W-NOMINATE coordinates of members of the 111th House of Representatives. Inventors Keith T. Poole [1] , University of Georgia [2] [4]
Howard Rosenthal
[3]
NOMINATE (an acronym for Nominal Three-Step Estimation) is a multidimensional scaling method developed by political scientists Keith T. Poole and Howard Rosenthal in the early 1980s to analyze preferential and choice data, such as legislative roll-call voting behavior.[5][6] As computing capabilities grew, Poole and Rosenthal developed multiple iterations of their NOMINATE procedure: the original D-NOMINATE method, W-NOMINATE, and most recently DW-NOMINATE (for dynamic, weighted NOMINATE). In 2009, Poole and Rosenthal were named the first recipients of the Society for Political Methodology's Best Statistical Software Award for their development of NOMINATE, a recognition conferred to "individual(s) for developing statistical software that makes a significant research contribution."[7]
Procedure
Though there are important technical differences between these types of NOMINATE scaling procedures;[8] all operate under the same fundamental assumptions. First, that alternative choices can be projected on a basic, low-dimensional (often two-dimensional) Euclidian space. Second, within that space, individuals have utility functions which are bell-shaped (normally distributed), and maximized at their ideal point. Because individuals also have symmetric, single-peaked utility functions which center on their ideal point, ideal points represent individuals' most preferred outcomes. That is, individuals most desire outcomes closest their ideal point, and will choose/vote probabilistically for the closest outcome. Ideal points can be recovered from observing choices, with individuals exhibiting similar preferences placed more closely than those behaving dissimilarly. It is helpful to compare this procedure to producing maps based on driving distances between cities. For example, Los Angeles is about 1,800 miles from St. Louis; St. Louis is about 1,200 miles from Miami; and Miami is about 2,700 miles from Los Angeles. From this (dis)similarities data, any map of these three cities should place Miami far from Los Angeles, with St. Louis somewhere in between (though a bit closer to Miami than Los Angeles). Just as cities like Los Angeles and San Francisco would be clustered on a map, NOMINATE places ideologically similar legislators (e.g., liberal Senators Barbara Boxer (D-Calif.) and Al Franken (D-Minn.)) closer to each other, and farther from dissimilar legislators (e.g., conservative Senator Tom Coburn (R-Okla.)) based on the degree of agreement between their roll call voting records. At the heart of the NOMINATE procedures (and other multidimensional scaling methods, such as Poole's Optimal Classification method[]) are
NOMINATE (scaling method) algorithms they utilize to arrange individuals and choices in low dimensional (usually two-dimensional) space. Thus, NOMINATE scores provide "maps" of legislatures.[9] Using NOMINATE procedures to study congressional roll call voting behavior from the First Congress to the present-day, Poole and Rosenthal published Congress: A Political-Economic History of Roll Call Voting[10] in 1997 and the revised edition Ideology and Congress[11] in 2007. Both were landmark works for their development and application of the use of sophisticated measurement and scaling methods in political science. These works also revolutionized the study of the American politics and, in particular, Congress. Their methods provided political scientistsfor the first timequantitative measures of Representatives' and Senators' ideology across chambers and across time.
208
Poole and Rosenthal demonstrate thatdespite the many complexities of congressional representation and politicsroll call voting in both the House and the Senate can be organized and explained by no more than two dimensions throughout the sweep of American history. The first dimension (horizontal or x-axis) is the familiar left-right (or liberal-conservative) spectrum on economic matters. The second dimension (vertical or y-axis) picks up attitudes on cross-cutting, salient issues of the day (which include or have included slavery, bimetallism, civil rights, regional, and social/lifestyle issues). For the most part, congressional voting is uni-dimensional, with most of the variation in voting patterns explained by placement along the liberal-conservative first dimension.
Interpreting scores
For illustrative purposes, consider the following plots which use W-NOMINATE scores to scale members of Congress and uses the probabilistic voting model (in which legislators farther from the cutting line between yea and nay outcomes become more likely to vote in the predicted manner) to illustrate some major Congressional votes in the 1990s. Some of these votes, like the Houses vote on President Clintons welfare reform package (the Personal Responsibility and Work Opportunity Act of 1996) are best modeled through the use of the first (economic liberal-conservative) dimension. On the welfare reform vote, nearly all Republicans joined the moderate-conservative bloc of House Democrats in voting for the bill, while opposition was virtually confined to the most liberal Democrats in the House. The errors (those representatives on the wrong side of the cutting line which separates predicted yeas and predicted nays) are generally close to the cutting line, which is what we would expect. A legislator directly on the cutting line is indifferent between voting yea and nay on the measure. All members are shown on the left panel of the plot, while only errors are shown on the right panel:
209
Economic ideology also dominates the Senate vote on the Balanced Budget Amendment of 1995:
On other votes, however, a second dimension (which has recently come to represent attitudes on cultural and lifestyle issues) is important. For example, roll call votes on gun control routinely split party coalitions, with socially conservative blue dog Democrats joining most Republicans in opposing additional regulation and socially liberal Republicans joining most Democrats in supporting gun control. The addition of the second dimension accounts for these inter-party differences, and the cutting line is more horizontal than vertical (meaning the cleavage is found on the second dimension rather than the first dimension on these votes) This pattern was evident in the 1991 House vote to require waiting periods on handguns:
210
Political polarization
Poole and Rosenthal (beginning with their 1984 article "The Polarization of American Politics"[12]) have also used NOMINATE data to show that, since the 1970s, party delegations in Congress have become ideologically homogeneous and distant from one another (a phenomenon known as "polarization.") Using DW-NOMINATE scores (which permit direct comparisons between members of different Congresses across time), political scientists have demonstrated the expansion of ideological divides in Congress, which has spurred Political polarization in the United States House intense partisanship between Republicans and Democrats in recent of Representatives. decades.[13][14][15][16][17][18][19] Contemporary political polarization has had important political consequences on American public policy, as Poole and Rosenthal (with fellow political scientist Nolan McCarty) show in their book Polarized America: The Dance of Ideology and Unequal Riches.[20]
Applications
NOMINATE has been used to test, refine, and/or develop wide-ranging theories and models of the United States Congress.[21][22] In Ideology and Congress (pp.270271), Poole and Rosenthal agree that their findings are consistent with the "party cartel" model that Cox and McCubbins present in their 1993 book Legislative Leviathan.[23] Keith Krehbiel utilizes NOMINATE scores to determine the ideological rank order of both chambers of Congress in developing his "pivotal politics" theory,[24] as do Gary Cox and Matthew McCubbins in their tests of whether parties in Congress meet the conditions of responsible party government (RPG).[25] NOMINATE scores are also used by popular media outlets like The New York Times and The Washington Post as a measure of the political ideology of political institutions and elected officials or candidates. Political blogger Nate Silver regularly uses DW-NOMINATE scores to gauge the ideological location of major political figures and institutions.[26][27][28][29] NOMINATE procedures and related roll call scaling techniques have also been applied to a number of other legislative bodies besides the United States Congress. These include the United Nations General Assembly,[30] the European Parliament[31] National Assemblies in Latin America,[32] and the French Fourth Republic.[33] Poole and
NOMINATE (scaling method) Rosenthal note in Chapter 11 of Ideology and Congress that most of these analyses produce the finding that roll call voting is organized by only few dimensions (usually two): "These findings suggest that the need to form parliamentary majorities limits dimensionality."[34]
211
References
[1] [2] [3] [4] [5] http:/ / polisci. uga. edu/ people/ profile/ dr_keith_poole http:/ / www. uga. edu/ http:/ / politics. as. nyu. edu/ object/ HowardRosenthal http:/ / www. nyu. edu/ Poole, Keith T. and Howard Rosenthal. 1983. "A Spatial Model for Legislative Roll Call Analysis." GSIA Working Paper No. 5- 83-84. http:/ / voteview. com/ Upside_Down-A_Spatial_Model_for_Legislative_Roll_Call_Analysis_1983. pdf [6] Poole, Keith T. and Howard Rosenthal. "A Spatial Model For Legislative Roll Call Analysis." American Journal of Political Science, May 1985, 357-384. [7] The Society for Political Methodology: Awards. http:/ / polmeth. wustl. edu/ about. php?page=awards [8] Description of NOMINATE Data. http:/ / www. voteview. com/ page2a. htm [10] Poole, Keith T. and Howard Rosenthal. 1997. Congress: A Political-Economic History of Roll Call Voting. New York: Oxford University Press. [11] Poole, Keith T. and Howard Rosenthal. 2007. Ideology and Congress. New Brunswick, NJ: Transaction Publishers. http:/ / www. transactionpub. com/ title/ Ideology-and%20Congress-978-1-4128-0608-4. html [12] Poole, Keith T. and Howard Rosenthal. 1984. "The Polarization of American Politics." Journal of Politics 46: 1061-79. http:/ / www. voteview. com/ The_Polarization_of_American_Politics_1984. pdf [13] Theriault, Sean M. 2008. Party Polarization in Congress. Cambridge: Cambridge University Press. [14] Jacobson, Gary. 2010. A Divider, Not a Uniter: George W. Bush and the American People. New York: Pearson Longman. [15] Abramowitz, Alan I. 2010. The Disappearing Center: Engaged Citizens, Polarization, and American Democracy. New Haven, CT: Yale University Press. [16] Levendusky, Matthew. 2009. The Partisan Sort: How Liberals Became Democrats and Conservatives Became Republicans. Chicago: University of Chicago Press. [17] Baldassarri, Delia, and Andrew Gelman. 2008. Partisans without Constraint: Political Polarization and Trends in American Public Opinion. American Journal of Sociology. 114(2): 408-46. [18] Fiorina, Morris P., with Samuel J. Abrams and Jeremy C. Pope. 2005. Culture Wars? The Myth of Polarized America. New York: Pearson Longman. [19] Hetherington, Marc J. 2001. "Resurgent Mass Partisanship: The Role of Elite Polarization." American Political Science Review 95: 619-631. [20] McCarty, Nolan, Keith T. Poole and Howard Rosenthal. 2006. Polarized America: The Dance of Ideology and Unequal Riches. Cambridge, MA: MIT Press. http:/ / www. voteview. com/ polarizedamerica. asp [21] Kiewiet, D. Roderick and Matthew D. McCubbins. 1991. The Logic of Delegation. Chicago: University of Chicago Press. [22] Shickler, Eric. 2000. "Institutional Change in the House of Representatives, 1867-1998: A Test of Partisan and Ideological Power Balance Models." American Political Science Review 94: 269-288. [23] Cox, Gary W. and Matthew D. McCubbins. 1993. Legislative Leviathan. Berkeley: University of California Press. [24] Krehbiel, Keith. 1998. Pivotal Politics: A Theory of U.S. Lawmaking. Chicago: University of Chicago Press. [25] Cox, Gary W. and Matthew D. McCubbins. 2005. Setting the Agenda: Responsible Party Government in the U.S. House of Representatives. New York: Cambridge University Press. [30] Voeten, Erik. 2001. "Outside Options and the Logic of Security Council Action." American Political Science Review 95: 845-858. [31] Hix, Simon, Abdul Noury, and Gerald Roland. 2006. "Dimensions of Politics in the European Parliament." American Journal of Political Science 50: 494-511. [32] Mogernstern, Scott. 2004. Patterns of Legislative Politics: Roll-Call Voting in Latin America and the United States. New York: Cambridge University Press. [33] Rosenthal, Howard and Erik Voeten. 2004. "Analyzing Roll Calls with Perfect Spatial Voting: France 1946-1958." American Journal of Political Science 48: 620-632. [34] Poole and Rosenthal, Ideology and Congress, p. 295.
212
External links
"NOMINATE and American Political History: A Primer." A helpful, more extensive introduction to NOMINATE (http://www.voteview.com/nominate_and_political_history_primer.pdf) Jordan Ellenberg, "Growing Apart: The Mathematical Evidence for Congress' Growing Polarization," Slate Magazine, 26 December 2001 (http://www.slate.com/id/2060047) "NOMINATE: A Short Intellectual History" (by Keith T. Poole) (http://www.voteview.com/nominate.pdf) Voteview website, with NOMINATE scores (http://www.voteview.com) Voteview Blog (http://voteview.com/blog/) W-NOMINATE in R: Software and Examples (http://www.voteview.com/wnominate_in_R.htm) Optimal Classification (OC) in R: Software and Examples (http://www.voteview.com/OC_in_R.htm)
Non-response bias
Non-response bias occurs in statistical surveys if the answers of respondents differ from the potential answers of those who did not answer.
Example
If one selects a sample of 1000 managers in a field and polls them about their workload, the managers with a high workload may not answer the survey because they do not have enough time to answer it, and/or those with a low workload may decline to respond for fear that their supervisors or colleagues will perceive them as unnecessary (either immediately, if the survey is non-anonymous, or in the future, should their anonymity be compromised by collusion, "leaks," insufficient procedural precautions, or data-security breaches). Therefore, non-response bias may make the measured value for the workload too low, too high, or, if the effects of the above biases happen to offset each other, "right for the wrong reasons."
Test
There are different ways to test for non-response bias. In e-mail surveys some values are already known from all potential participants (e.g. age, branch of the firm, ...) and can be compared to the values that prevail in the subgroup of those who answered. If there is no significant difference this is an indicator that there might be no non-response bias. In e-mail surveys those who didn't answer can also systematically be phoned and a small number of survey questions can be asked. If their answers don't differ significantly from those who answered the survey, there might be no non-response bias. This technique is sometimes called non-response follow-up. Generally speaking, the lower the response rate, the greater the likelihood of a non-response bias in play.
Related terminology
Response bias is not the opposite of "non-response bias" but instead relates to a possible tendency of respondents to give an answer a) of which they believe that the questioner, or society in general, might approve or b) that they perceive would help yield a result that would tend to promote some desired goal of their own. Special issue of Public Opinion Quarterly (Volume 70, Issue 5) about "Nonresponse Bias in Household Surveys": http://poq.oxfordjournals.org/content/70/5.toc
Norm-referenced test
213
Norm-referenced test
A norm-referenced test (NRT) is a type of test, assessment, or evaluation which yields an estimate of the position of the tested individual in a predefined population, with respect to the trait being measured. This estimate is derived from the analysis of test scores and possibly other relevant data from a sample drawn from the population.[1] That is, this type of test identifies whether the test taker performed better or worse than other test takers, but not whether the test taker knows either more or less material than is necessary for a given purpose. The term normative assessment refers to the process of comparing one test-taker to his or her peers.[1] Norm-referenced assessment can be contrasted with criterion-referenced assessment and ipsative assessment. In a criterion-referenced assessment, the score shows whether or not the test takers performed well or poorly on a given task, but not how that compares to other test takers; in an ipsative system, the test taker is compared to his previous performance.
Other types
Alternative to normative testing, tests can be ipsative, that is, the individual assessment is compared to him- or herself through time.[2][3] By contrast, a test is criterion-referenced when provision is made for translating the test score into a statement about the behavior to be expected of a person with that score. The same test can be used in both ways.[4] Robert Glaser originally coined the terms norm-referenced test and criterion-referenced test.[] Standards-based education reform is based on the belief that public education should establish what every student should know and be able to do.[5] Students should be tested against a fixed yardstick, rather than against each other or sorted into a mathematical bell curve.[6] By assessing that every student must pass these new, higher standards, education officials believe that all students will achieve a diploma that prepares them for success in the 21st century.[7]
Common use
Most state achievement tests are criterion referenced. In other words, a predetermined level of acceptable performance is developed and students pass or fail in achieving or not achieving this level. Tests that set goals for students based on the average student's performance are norm-referenced tests. Tests that set goals for students based on a set standard (e.g., 80 words spelled correctly) are criterion-referenced tests. Many college entrance exams and nationally used school tests use norm-referenced tests. The SAT, Graduate Record Examination (GRE), and Wechsler Intelligence Scale for Children (WISC) compare individual student performance to the performance of a normative sample. Test-takers cannot "fail" a norm-referenced test, as each test-taker receives a score that compares the individual to others that have taken the test, usually given by a percentile. This is useful when there is a wide range of acceptable scores that is different for each college. By contrast, nearly two-thirds of US high school students will be required to pass a criterion-referenced high school graduation examination. One high fixed score is set at a level adequate for university admission whether the high school graduate is college bound or not. Each state gives its own test and sets its own passing level, with states like Massachusetts showing very high pass rates, while in Washington State, even average students are failing, as well as 80 percent of some minority groups. This practice is opposed by many in the education community such as Alfie Kohn as unfair to groups and individuals who don't score as high as others.
Norm-referenced test
214
Norm-referenced test A rank-based system only produces data which tell which average students perform at an average level, which students do better, and which students do worse. This contradicts the fundamental beliefs, whether optimistic or simply unfounded, that all will perform at one uniformly high level in a standards based system if enough incentives and punishments are put into place. This difference in beliefs underlies the most significant differences between a traditional and a standards based education system.
215
Examples
IQ tests are norm-referenced tests, because their goal is to see which test taker is more intelligent than the other test takers. Theater auditions and job interviews are norm-referenced tests, because their goal is to identify the best candidate compared to the other candidates, not to determine how many of the candidates meet a fixed list of standards.
References
[1] [2] [3] [4] Assessment Guided Practices (https:/ / fp. auburn. edu/ rse/ trans_media/ 08_Publications/ 06_Transition_in _Action/ chap8. htm) Assessment (http:/ / www. dmu. ac. uk/ ~jamesa/ teaching/ assessment. htm) PDF presentation (http:/ / www. psychology. nottingham. ac. uk/ staff/ nfr/ rolefunction. pdf) Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York: Harper & Row.
[5] (http:/ / www. isbe. state. il. us/ ils/ ) Illinois Learning Standards [6] stories 5-01.html (http:/ / www. fairtest. org/ nattest/ times) Fairtest.org: Times on Testing "criterion referenced" tests measure students against a fixed yardstick, not against each other. [7] (http:/ / www. newhorizons. org/ spneeds/ improvement/ bergeson. htm) By the Numbers: Rising Student Achievement in Washington State by Terry Bergesn "She continues her pledge ... to ensure all students achieve a diploma that prepares them for success in the 21st century." [8] (http:/ / www. nctm. org/ news/ assessment/ 2004_04nb. htm) NCTM: News & Media: Assessment Issues (Newsbulletin April 2004) "by definition, half of the nation's students are below grade level at any particular moment" [9] (http:/ / www. readingfoundation. org/ about/ about_us. asp) National Children's Reading Foundation website [10] (http:/ / www. leg. wa. gov/ pub/ billinfo/ 2001-02/ house/ 2075-2099/ 2087_hbr. pdf) HOUSE BILL REPORT HB 2087 "A number of critics ... continue to assert that the mathematics WASL is not developmentally appropriate for fourth grade students." [11] Prof Don Orlich, Washington State University [12] (http:/ / archives. seattletimes. nwsource. com/ cgi-bin/ texis. cgi/ web/ vortex/ display?slug=wasl11m& date=20040511') Panel lowers bar for passing parts of WASL By Linda Shaw, Seattle Times May 11, 2004 "A blue-ribbon panel voted unanimously yesterday to lower the passing bar in reading and math for the fourth- and seventh-grade exam, and in reading on the 10th-grade test" [13] (http:/ / archives. seattletimes. nwsource. com/ cgi-bin/ texis. cgi/ web/ vortex/ display?slug=mathtest06m& date=20021206& query=WASL+ 7th+ grade) Seattle Times December 06, 2002 Study: Math in 7th-grade WASL is hard By Linda Shaw "Those of you who failed the math section ... last spring had a harder test than your counterparts in the fourth or 10th grades." [14] (http:/ / www. state. nj. us/ njded/ njpep/ assessment/ naep/ index. html) New Jersey Department of Education: "But we already have tests in New Jersey, why have another test? Our statewide test is an assessment that only New Jersey students take. No comparisons should be made to other states, or to the nation as a whole. [15] (http:/ / www. rand. org/ pubs/ research_briefs/ RB8017/ index1. html) Test-Based Accountability Systems (Rand) "NAEP data are particularly important ...Taken together, these trends suggest appreciable inflation of gains on KIRIS. ... [16] (http:/ / www. transitionmathproject. org/ assetts/ docs/ highlights/ wasl_report. doc) Relationship of the Washington Assessment of Student Learning (WASL) and Placement Tests Used at Community and Technical Colleges By: Dave Pavelchek, Paul Stern and Dennis Olson Social & Economic Sciences Research Center, Puget Sound Office, WSU "The average difficulty ratings for WASL test questions fall in the middle of the range of difficulty ratings for the college placement tests."
External links
A webpage (http://www.citrus.kcusd.com/instruction.htm) about instruction that discusses assessment
216
Normal curve equivalents are on an equal-interval scale (see [2] and [3] for examples). This is advantageous compared to percentile rank scales, which suffer from the problem that the difference between any two scores is not the same as that between any other two scores (see below or percentile rank for more information). The major advantage of NCEs over percentile ranks is that NCEs can be legitimately averaged.[4] The Rochester School Department webpage describes how NCE scores change: In a normally distributed population, if all students were to make exactly one year of progress after one year of instruction, then their NCE scores would remain exactly the same and their NCE gain would be zero, even though their raw scores (i.e. the number of questions they answered correctly) increased. Some students will make more than a year's progress in that time and will have a net gain in the NCE score, which means that those students have learned more, or at least have made more progress in the areas tested, than the general population. Other students, while making progress in their skills, may progress more slowly than the general population and will show a net loss in their NCE ranks.
Caution
Careful consideration is required when computing effect sizes using NCEs. NCEs differ from other scores, such as raw and scaled scores, in the magnitude of the effect sizes. Comparison of NCEs typically results in smaller effect sizes, and using the typical ranges for other effect sizes may result in interpretation errors.[5] Excel formula for conversion from Percentile to NCE: =21.06*NORMSINV(PR/100)+50, where PR is the percentile value. Excel formula for conversion from NCE to Percentile: =100*NORMSDIST((NCE-50)/21.06), where NCE is the Normal Curve Equivalent (NCE) value
217
References
[1] Mertler, C. A. (2002). Using standardized test data to guide instruction and intervention. College Park, MD: ERIC Clearinghouse on Assessment and Evaluation. ( ERIC Document Reproduction Service (http:/ / www. eric. ed. gov/ ) No. ED470589)
Normal curve equivalent (NCE): A normalized standardized score with a mean of 50 and a standard deviation of 21.06 resulting in a near equal interval scale from 0 to 99. The NCE was developed by RMC Research Corporation in 1976 to measure the effectiveness of the Title I Program across the United States and is often used to measure gains over time. (p. 3)
[2] [3] [4] [5] http:/ / www. rochesterschools. com/ Webmaster/ StaffHelp/ rdgstudy/ ncurve2. gif http:/ / www. citrus. kcusd. com/ gif/ bellcurve. gif Rochester School Department (http:/ / www. rochesterschools. com/ Webmaster/ StaffHelp/ rdgstudy/ nce. html) webpage McLean, J. E., O'Neal, M. R., & Barnette, J. J. (2000, November). Are all effect sizes created equal? Paper presented at the Annual Meeting of the Mid-South Educational Research Association, Bowling Green, KY. ( ERIC Document Reproduction Service (http:/ / www. eric. ed. gov/ ) No. ED448188)
External links
Norm Scale Calculator (http://www.psychometrica.de/normwertrechner_en.html) (Utility for the Transformation and Visualization of Norm Scores) Scholastic Testing Service (http://ststesting.com/explainit.html), a glossary of terms related to the bell or normal curve. UCLA stats: How should I analyze percentile rank data (http://www.ats.ucla.edu/stat/stata/faq/prank.htm) describing how to convert percentile ranks to NCEs with Stata.
Objective test
An objective test is a psychological test that measures an individual's characteristics in a way that is independent of rater bias or the examiner's own beliefs, usually by the administration of a bank of questions that are marked and compared against exacting scoring mechanisms that are completely standardized, much in the same way that examinations are administered. Objective tests are often contrasted with projective tests, which are sensitive to rater or examiner beliefs. Projective tests are based on Freudian Psychology (Psychoanalysis), and seek to expose the unconscious perceptions of people. Objective tests tend to have more validity than projective tests, however they are still subject to the willingness of the subject to be open about his/her personality and as such can sometimes be badly representative of the true personality of the subject. Projective tests purportedly expose certain aspects of the personality of individuals that are impossible to measure by means of an objective test, and are much more reliable at uncovering "protected" or unconscious personality traits or features. An objective test is built by following a rigorous protocol which includes the following steps: Making decisions on nature, goal, target population, power. Creating a bank of questions. Estimating the validity of the questions, by means of statistical procedures and/or judgement of experts in the field. Designing a format of application (a clear, easy-to-answer questionnaire, or an interview, etc.). Detecting which questions are better in terms of discrimination, clarity, ease of response, upon application on a pilot sample. Applying a revised questionnaire or interview to a sample. Using appropriate statistical procedures to establish norms for the test.
Objective test
218
References
Online assessment
Online assessment is the process used to measure certain aspects of information for a set purpose where the assessment is delivered via a computer connected to a network. Most often the assessment is some type of educational test. Different types of online assessments contain elements of one or more of the following components, depending on the assessment's purpose: formative, diagnostic, or summative. Instant and detailed feedback, as well as flexibility of location and time, are just two of the many benefits associated with online assessments. There are many resources available that provide online assessments, some free of charge and others that charge fees or require a membership.
Purpose of assessments
Assessments are a vital part of determining student achievement. They are used to determine the knowledge gained by students and to determine if adjustments need to be made to either the teaching or learning process.[1]
Online assessment Practice Testing - With the ever-increasing use of high-stakes testing in the educational arena, online practice tests are used to give students an edge. Students can take these types of assessments multiple times to familiarize themselves with the content and format of the assessment. Surveys - Online surveys may be used by educators to collect data and feedback on student attitudes, perceptions or other types of information that might help improve the instruction. Evaluations - This type of survey allows facilitators to collect data and feedback on any type of situation where the course or experience needs justification or improvement. Performance Testing - The user shows what they know and what they can do. This type of testing is used to show technological proficiency, reading comprehension, math skills, etc. This assessment is also used to identify gaps in student learning. New technologies, such as the Web, digital video, sound, animations, and interactivity, are providing tools that can make assessment design and implementation more efficient, timely, and sophisticated.
219
Academic Dishonesty
Academic dishonesty, commonly known as cheating, occurs in all levels of educational institutions. In traditional classrooms, students cheat in various forms such as hidden prepared notes not permitted to be used or looking at another students paper during an exam, copying homework from one another, or copying from a book, article or media without properly citing the source. Individuals can be dishonest due to lack of time management skills, pursuit for better grades, cultural behavior or a misunderstanding of plagiarism.[5] Online classroom environments are no exception to the possibility of academic dishonesty. It can easily be seen from a students perspective as an easy passing grade. Proper assignments types, meetings and projects can prevent academic dishonesty in the online classroom.[6]
Online assessment
220
References
Operational definition
An operational definition, also called functional definition,[1][2] defines something (e.g. a variable, term, or object) in terms of the specific process or set of validation tests used to determine its presence and quantity. That is, one defines something in terms of the operations that count as measuring it.[3] The term was coined in philosophy of science book The Logic of Modern Physics (1927), by Percy Williams Bridgman, and is a part of the process of operationalization. One might use definitions that rely on operations in order to avoid the troubles associated with attempting to define things in terms of some intrinsic essence.
The operational definition of a peanut butter sandwich might be simply "the result
An example of an operational definition of putting peanut butter on a slice of bread with a butter knife and laying a second equally sized slice of bread on top" might be defining the weight of an object in terms of the numbers that appear when that object is placed on a weighing scale. The weight then, is whatever results from following the (weight) measurement procedure, which should be repeatable by anyone. This is in contrast to operationalization that uses theoretical definitions.
Overview
Properties described in this manner must be sufficiently accessible, so that persons other than the definer may independently measure or test for them at will.[citation needed] An operational definition is generally designed to model a theoretical definition. The most operational definition is a process for identification of an object by distinguishing it from its background of empirical experience. The binary version produces either the result that the object exists, or that it doesn't, in the experiential field to which it is applied. The classifier version results in discrimination between what is part of the object and what is not part of it. This is also discussed in terms of semantics, pattern recognition, and operational techniques, such as regression. Operationalize means to put into operation. Operational definitions are also used to define system states in terms of a specific, publicly accessible process of preparation or validation testing, which is repeatable at will. For example, 100 degrees Celsius may be crudely defined by describing the process of heating water at sea level until it is observed to boil. An item like a brick, or even a photograph of a brick, may be defined in terms of how it can be made. Likewise, iron may be defined in terms of the results of testing or measuring it in particular ways. Vandervert (1980/1988) described in scientific detail a simple, every day illustration of an operational definition in terms of making a cake (i.e., its recipe is an operational definition used in a specialized laboratory known as the household kitchen). Similarly, the saying, if it walks like a duck and quacks like a duck, it must be some kind of duck, may be regarded as involving a sort of measurement process or set of tests (see duck test).
Operational definition
221
Application
Despite the controversial philosophical origins of the concept, particularly its close association with logical positivism, operational definitions have undisputed practical applications. This is especially so in the social and medical sciences, where operational definitions of key terms are used to preserve the unambiguous empirical testability of hypothesis and theory. Operational definitions are also important in the physical sciences.
Philosophy
The Stanford Encyclopedia of Philosophy says the following about Operationalism as written by Richard Boyd:[4] The idea originally arises in the operationalist philosophy of P. W. Bridgman and others. By 1914, Bridgman was dismayed by the abstraction and lack of clarity with which, he argued, many scientific concepts were expressed. Inspired by logical positivism and the phenomenalism of Ernst Mach, in 1914 he declared that the meaning of a theoretical term (or unobservable entity), such as electron density, lay in the operations, physical and mental, performed in its measurement. The goal was to eliminate all reference to theoretical entities by "rationally reconstructing" them in terms of the particular operations of laboratory procedures and experimentation. Hence, the term electron density could be analyzed into a statement of the following form: (*) The electron density of an object, O, is given by the value, x, if and only if P applied to O yields the value x, where P stands for an instrument that scientists take as a procedure for measuring electron density. Operationalism, defined in this way, was rejected even by the logical positivists, due to inherent problems: defining terms operationally necessarily implied the analytic necessity of the definition. The analyticity of operational definitions like (*) is essential to the project of rational reconstruction. Operationalism is not, for example, the idea that electron density is defined as whatever magnitude instruments of the sort P reliably measure. On that conception (*) would represent an empirical discovery about how to measure electron density, but -- since electrons are unobservables -- that's a realist conception not an empiricist one. What the project of rational reconstruction requires is that (*) be true purely as a matter of linguistic stipulation about how the term "electron density" is to be used. Since (*) is supposed to be analytic, it's supposed to be unrevisable. There is supposed to be no such thing as discovering, about P, that some other instrument provides a more accurate value for electron density, or provides values for electron density under conditions where P doesn't function. Here again, thinking that there could be such an improvement in P with respect to electron density requires thinking of electron density as a real feature of the world which P (perhaps only approximately) measures. But that's the realist conception that operationalism is designed rationally to do away with! In actual, and apparently reliable, scientific practice, changes in the instrumentation associated with theoretical terms are routine, and apparently crucial to the progress of science. According to a 'pure' operationalist conception, these sorts of modifications would not be methodologically acceptable, since each definition must be considered to identify a unique 'object' (or class of objects). In practice, however, an 'operationally defined' object is often taken to be that object which is determined by a constellation of different unique 'operational procedures.' Most logical empiricists were not willing to accept the conclusion that operational definitions must be unique (in contradiction to 'established' scientific practice). So they felt compelled to reject operationalism. In the end, it reduces to a reductio ad absurdum, since each measuring instrument must itself be operationally defined, in infinite regress... But this was also a failure of the logical positivist approach generally. However, this rejection of operationalism as a general project destined ultimately to define all experiential phenomena uniquely did not mean that operational definitions ceased to have any practical use or that they could not be applied in particular cases.
Operational definition
222
Science
The special theory of relativity can be viewed as the introduction of operational definitions for simultaneity of events and of distance, that is, as providing the operations needed to define these terms.[] In quantum mechanics the notion of operational definitions is closely related to the idea of observables, that is, definitions based upon what can be measured.[][] Operational definitions are at their most controversial in the fields of psychology and psychiatry, where intuitive concepts, such as intelligence need to be operationally defined before they become amenable to scientific investigation, for example, through processes such as IQ tests. Such definitions are used as a follow up to a theoretical definition, in which the specific concept is defined as a measurable occurrence. John Stuart Mill pointed out the dangers of believing that anything that could be given a name must refer to a thing and Stephen Jay Gould and others have criticized psychologists for doing just that. A committed operationalist would respond that speculation about the thing in itself, or noumenon, should be resisted as meaningless, and would comment only on phenomena using operationally defined terms and tables of operationally defined measurements. A behaviorist psychologist might (operationally) define intelligence as that score obtained on a specific IQ test (e.g., the Wechsler Adult Intelligence Scale test) by a human subject. The theoretical underpinnings of the WAIS would be completely ignored. This WAIS measurement would only be useful to the extent it could be shown to be related to other operationally defined measurements, e.g., to the measured probability of graduation from university.[5] Operational definitions are the foundation of the diagnostic nomenclature of mental disorders (classification of mental disorders) from the DSM-III onward.[6][7]
Business
On October 15, 1970, the West Gate Bridge in Melbourne, Australia collapsed, killing 35 construction workers. The subsequent enquiry found that the failure arose because engineers had specified the supply of a quantity of flat steel plate. The word flat in this context lacked an operational definition, so there was no test for accepting or rejecting a particular shipment or for controlling quality. In his managerial and statistical writings, W. Edwards Deming placed great importance on the value of using operational definitions in all agreements in business. As he said: "An operational definition is a procedure agreed upon for translation of a concept into measurement of some kind." - W. Edwards Deming "There is no true value of any characteristic, state, or condition that is defined in terms of measurement or observation. Change of procedure for measurement (change of operational definition) or observation produces a new number." - W. Edwards Deming
General process
Operational, in a process context, also can denote a working method or a philosophy that focuses principally on cause and effect relationships (or stimulus/response, behavior, etc.) of specific interest to a particular domain at a particular point in time. As a working method, it does not consider issues related to a domain that are more general, such as the ontological, etc. The term can be used strictly within the realm of the interactions of humans with advanced computational systems. In this sense, an AI system cannot be entirely operational (this issue can be used to discuss strong versus weak AI) if learning is involved. Given that one motive for the operational approach is stability, systems that relax the operational factor can be problematic, for several reasons, as the operational is a means to manage complexity. There will be differences in the nature of the operational as it pertains to degrees along the end-user computing axis.
Operational definition For instance, a knowledge-based engineering system can enhance its operational aspect and thereby its stability through more involvement by the SME, thereby opening up issues of limits that are related to being human, in the sense that, many times, computational results have to be taken at face value due to several factors (hence the duck test's necessity arises) that even an expert cannot overcome. The end proof may be the final results (reasonable facsimile by simulation or artifact, working design, etc.) that are not guaranteed to be repeatable, may have been costly to attain (time and money), and so forth. Many domains, with a numerics focus, use limits logic to overcome the duck test necessity with varying degrees of success. Complex situations may require logic to be more non-monotonic than not raising concerns related to the qualification, frame, and ramification problems.
223
Examples
Temperature
The thermodynamic definition of temperature, due to Nicolas Lonard Sadi Carnot, refers to heat "flowing" between "infinite reservoirs". This is all highly abstract and unsuited for the day-to-day world of science and trade. In order to make the idea concrete, temperature is defined in terms of operations with the gas thermometer. However, these are sophisticated and delicate instruments, only adapted to the national standardization laboratory. For day-to-day use, the International Temperature Scale of 1990 (ITS) is used, defining temperature in terms of characteristics of the several specific sensor types required to cover the full range. One such is the electrical resistance of a thermistor, with specified construction, calibrated against operationally defined fixed points. Therefore it can be seen that heat is hot.
Electric current
Electric current is defined in terms of the force between two infinite parallel conductors, separated by a specified distance. This definition is too abstract for practical measurement, so a device known as a current balance is used to define the ampere operationally.
Mechanical hardness
Unlike temperature and electric current, there is no abstract physical concept of the hardness of a material. It is a slightly vague, subjective idea, somewhat like the idea of intelligence. In fact, it leads to three more specific ideas: 1. Scratch hardness measured on Mohs' scale; 2. Indentation hardness; and 3. Rebound, or dynamic, hardness measured with a Shore scleroscope. Of these, indentation hardness itself leads to many operational definitions, the most important of which are: 1. Brinell hardness test using a 10mm steel ball; 2. Vickers hardness test using a pyramidal diamond indenter; and 3. Rockwell hardness test using a diamond cone indenter. In all these, a process is defined for loading the indenter, measuring the resulting indentation and calculating a hardness number. Each of these three sequences of measurement operations produces numbers that are consistent with our subjective idea of hardness. The harder the material to our informal perception, the greater the number it will achieve on our respective hardness scales. Furthermore, experimental results obtained using these measurement methods has shown that the hardness number can be used to predict the stress required to permanently deform steel, a characteristic that fits in well with our idea of resistance to permanent deformation. However, there is not always a simple relationship between the various hardness scales. Vickers and Rockwell hardness numbers exhibit qualitatively different behaviour when used to describe some materials and phenomena.
Operational definition
224
Duck typing
In advanced modeling, with the requisite computational support such as knowledge-based engineering, mappings must be maintained between a real-world object, its abstracted counterparts as defined by the domain and its experts, and the computer models. Mismatches between domain models and their computational mirrors can raise issues that are apropos to this topic. Techniques that allow the flexible modeling required for many hard problems must resolve issues of identity, type, etc. which then lead to methods, such as duck typing.
Weight: a measurement of gravitational force acting on an object a result of measurement of an object on a newton spring scale
Further reading
Ballantyne, Paul F. History and Theory of Psychology Course, in Langfeld, H.S. (1945) Introduction to the Symposium on Operationism. Psyc. Rev. 32, 241-243. (http://www.comnet.ca/~pballan/operationism(1945). htm) Bohm, D. (1996). On dialog. N.Y.: Routledge. Boyd, Richard. On the Current Status of the Issue of Scientific Realism in Erkenntnis. 19: 45-90. Bridgman, P. W. The way things are. Cambridge: Harvard University Press. (1959) Carnap, R. The Elimination of Metaphysics Through Logical Analysis of Language in Ayer, A.J. 1959. Churchland, Patricia, Neurophilosophy Toward a unified science of the mind/brain, MIT Press (1986). Churchland, Paul., A Neurocomputational Perspective The Nature of Mind and the Structure of Science, MIT Press (1989). Dennett, Daniel C. Consciousness Explained, Little, Brown & Co.. 1992. Depraz, N. (1999). "The phenomenological reduction as praxis." Journal of Consciousness Studies, 6(2-3), 95-110. Hardcastle, G. L. (1995). "S.S. Stevens and the origins of operationism." Philosophy of Science, 62, 404-424. Hermans, H. J. M. (1996). "Voicing the self: from information processing to dialogical interchange." Psychological Bulletin, 119(1), 31-50.
Operational definition Hyman, Bronwen, U of Toronto, and Shephard, Alfred H., U of Manitoba, "Zeitgeist: The Development of an Operational Definition", The Journal of Mind and Behavior, 1(2), pps. 227-246 (1980) Leahy, Thomas H., Virginia Commonwealth U, The Myth of Operationism, ibid, pps. 127-144 (1980) Ribes-Inesta, Emilio "What Is Defined In Operational Definitions? The Case Of Operant Psychology," Behavior and Philosophy, 2003. (http://www.findarticles.com/p/articles/mi_qa3814/is_200301/ai_n9222880) Roepstorff, A. & Jack, A. (2003). "Editorial introduction, Special Issue: Trusting the Subject? (Part 1)." Journal of Consciousness Studies, 10(9-10), v-xx. Roepstorff, A. & Jack, A. (2004). "Trust or Interaction? Editorial introduction, Special Issue: Trusting the Subject? (Part 2)." Journal of Consciousness Studies, 11(7-8), v-xxii. Stevens, S. S. Operationism and logical positivism, in M. H. Marx (Ed.), Theories in contemporary psychology (pp.4776). New York: MacMillan. (1963) Thomson Waddsworth, eds., Learning Psychology: Operational Definitions Research Methods Workshops (http:// www.wadsworth.com/psychology_d/templates/student_resources/workshops/res_methd/op_def/op_def_01. html)
225
Operationalization
In social science and humanities, operationalization is the process of defining a fuzzy concept so as to make the concept clearly distinguishable or measurable and to understand it in terms of empirical observations. In a wider sense it refers to the process of specifying the extension of a concept describing what is and is not a part of that concept. Operationalization often means creating operational definitions and theoretical definitions.
Theory
Early operationalism
Operationalization is used to An example of operationally defining "personal space".[citation needed] specifically refer to the scientific practice of operationally defining, where even the most basic concepts are defined through the operations by which we measure them. This comes from the philosophy of science book The Logic of Modern Physics (1927), by Percy Williams Bridgman, whose methodological position is called operationalism.[1] Bridgman's theory was criticized because we measure "length" in various ways (e.g. it's impossible to use a measuring rod if we want to measure the distance to the Moon), "length" logically isn't one concept but many.[citation needed] Each concept is defined by the measuring operations used. Another example is the radius of a sphere, obtaining different values depending on the way it is measured (say, in metres and in millimeters). Bridgman said the
Operationalization concept is defined on the measurement. So the criticism is that we would end up with endless concepts, each defined by the things that measured the concept.[citation needed] Bridgman notes that in the theory of relativity we see how a concept like "duration" can split into multiple different concepts. As part of the process of refining a physical theory, it may be found that what was one concept is, in fact, two or more distinct concepts. However, Bridgman proposes that if we only stick to operationally defined concepts, this will never happen.
226
Operationalization
The practical 'operational definition' is generally understood as relating to the theoretical definitions that describe reality through the use of theory. The importance of careful operationalization can perhaps be more clearly seen in the development of General Relativity. Einstein discovered that there were two operational definitions of "mass" being used by scientists: inertial, defined by applying a force and observing the acceleration, from Newton's Second Law of Motion; and gravitational, defined by putting the object on a scale or balance. Previously, no one had paid any attention to the different operations used because they always produced the same results,[citation needed] but the key insight of Einstein was to posit the Principle of Equivalence that the two operations would always produce the same result because they were equivalent at a deep level, and work out the implications of that assumption, which is the General Theory of Relativity. Thus, a breakthrough in science was achieved by disregarding different operational definitions of scientific measurements and realizing that they both described a single theoretical concept. Einstein's disagreement with the operationalist approach was criticized by Bridgman[2] as follows: "Einstein did not carry over into his general relativity theory the lessons and insights he himself has taught us in his special theory." (p.335).
Anger example
For example, a researcher may wish to measure the concept "anger." Its presence, and the depth of the emotion, cannot be directly measured by an outside observer because anger is intangible. Rather, other measures are used by outside observers, such as facial expression, choice of vocabulary, loudness and tone of voice.
If a researcher wants to measure the academic paper. This particular example is tailored to use in the field of Political Science. depth of "anger" in various persons, the most direct operation would be to ask them a question, such as "are you angry", or "how angry are you?". This operation is problematic, however, because it depends upon the definition of the individual. Some people might be subjected to a mild annoyance, and become slightly angry, but describe themselves as "extremely angry," whereas others might be subjected to a severe provocation, and become very angry, but describe themselves as "slightly angry." In addition, in many circumstances it is impractical to ask subjects whether they are angry.
Operationalization Since one of the measures of anger is loudness, the researcher can operationalize the concept of anger by measuring how loudly the subject speaks compared to his normal tone. However, this must assume that loudness is uniform measure. Some might respond verbally while other might respond physically. This makes anger a non-operational variable.
227
Economics objections
One of the main critics of operationalism in social science argues that "the original goal was to eliminate the subjective mentalistic concepts that had dominated earlier psychological theory and to replace them with a more operationally meaningful account of human behavior. But, as in economics, the supporters ultimately ended up "turning operationalism inside out".[3] "Instead of replacing 'metaphysical' terms such as 'desire' and 'purpose'" they "used it to legitimize them by giving them operational definitions." Thus in psychology, as in economics, the initial, quite radical operationalist ideas eventually came to serve as little more than a "reassurance fetish"[4] for mainstream methodological practice."[5]
Operationalization
228
Notes
[1] The basic operationalist thesiswhich can be considered a variation on the positivist themewas that all theoretical terms must be defined via the operations by which one measured them; see Crowther-Heyck, Hunter (2005), Herbert A. Simon: The Bounds of Reason in Modern America, JHU Press, p. 65 (http:/ / books. google. com/ books?id=LV1rnS9NBjkC& pg=PA65). [2] P.W. Bridgman, Einstein's Theories and the Operational Point of View, in: P.A. Schilpp, ed., Albert Einstein: Philosopher-Scientist, Open Court, La Salle, Ill., Cambridge University Press, 1982, Vol. 2, p. 335354. [3] Green 2001 Operationalism Again: What Did Bridgman Say? What Did Briclgman Need? in Theory and Psychology 11 (2001) p.49 [4] Koch, Sigmund (1992) Psychologys Bridgman vs. Bridgmans Bridgman: An Essay in Reconstruction., in Theory and Psychology vol. 2 no. 3 (1992) p.275 [5] Wade Hands (2004) "On operationalisms and economics" (December 2004) (http:/ / www. redorbit. com/ news/ science/ 112364/ on_operationalisms_and_economics/ )
Bibliography
Bridgman, P.W. (1927). The Logic of Modern Physics.
Opinion poll
An opinion poll, sometimes simply referred to as a poll, is a survey of public opinion from a particular sample. Opinion polls are usually designed to represent the opinions of a population by conducting a series of questions and then extrapolating generalities in ratio or within confidence intervals.
History
The first known example of an opinion poll was a local straw poll conducted by The Harrisburg Pennsylvanian in 1824, showing Andrew Jackson leading John Quincy Adams by 335 votes to 169 in the contest for the United States Presidency. Since Jackson won the popular vote in that state and the whole country, such straw votes gradually became more popular, but they remained local, usually city-wide phenomena. In 1916, the Literary Digest embarked on a national survey (partly as a circulation-raising exercise) and correctly predicted Woodrow Wilson's election as president. Mailing out millions of postcards and simply counting the returns, the Digest correctly predicted the victories of Warren Harding in 1920, Calvin Coolidge in 1924, Herbert Hoover in 1929, and Franklin Roosevelt in 1932. Then, in 1936, its 2.3 million "voters" constituted a huge sample; however, they were generally more affluent Americans who tended to have Republican sympathies. The Literary Digest was ignorant of this new bias. The week before election day, it reported that Alf Landon was far more popular than Roosevelt. At the same time, George Gallup conducted a far smaller, but more scientifically based survey, in which he polled a demographically representative sample. Gallup correctly predicted Roosevelt's landslide victory. The Literary Digest soon went out of business, while polling started to take off. Elmo Roper was another American pioneer in political forecasting using scientific polls.[] He predicted the reelection of President Franklin D. Roosevelt three times, in 1936, 1940, and 1944. Louis Harris had been in the field of public opinion since 1947 when he joined the Elmo Roper firm, then later became partner. In September 1938 Jean Stoetzel, after having met Gallup, created IFOP, the Institut Franais d'Opinion Publique, as the first European survey institute in Paris and started political polls in summer 1939 with the question "Why die for Danzig?", looking for popular support or dissent with this question asked by appeasement politician and future collaborationist Marcel Dat. Gallup launched a subsidiary in the United Kingdom that, almost alone, correctly predicted Labour's victory in the 1945 general election, unlike virtually all other commentators, who expected a victory for the Conservative Party, led by Winston Churchill.
Opinion poll The Allied occupation powers helped to create survey institutes in all of the Western occupation zones of Germany in 1947 and 1948 to better steer denazification. By the 1950s, various types of polling had spread to most democracies.
229
Polls can be used in the public relation field as well. In the early 1920s Public Relation experts described their work as a two way street. Their job would be to present the misinterpreted interests of large institutions to public. They would also gauge the typically ignored interests of the public through polls.
Benchmark polls
A benchmark poll is generally the first poll taken in a campaign. It is often taken before a candidate announces their bid for office but sometimes it happens immediately following that announcement after they have had some opportunity to raise funds. This is generally a short and simple survey of likely voters. A benchmark poll serves a number of purposes for a campaign, whether it is a political campaign or some other type of campaign. First, it gives the candidate a picture of where they stand with the electorate before any campaigning takes place. If the poll is done prior to announcing for office the candidate may use the poll to decide whether or not they should even run for office. Secondly, it shows them where their weaknesses and strengths are in two main areas. The first is the electorate. A benchmark poll shows them what types of voters they are sure to win, those who they are sure to lose, and everyone in-between those two extremes. This lets the campaign know which voters are persuadable so they can spend their limited resources in the most effective manner. Second, it can give them an idea of what messages, ideas, or slogans are the strongest with the electorate.[1]
Brushfire polls
Brushfire Polls are polls taken during the period between the Benchmark Poll and Tracking Polls. The number of Brushfire Polls taken by a campaign is determined by how competitive the race is and how much money the campaign has to spend. These polls usually focus on likely voters and the length of the survey varies on the number of messages being tested. Brushfire polls are used for a number of purposes. First, it lets the candidate know if they have made any progress on the ballot, how much progress has been made, and in what demographics they have been making or losing ground. Secondly, it is a way for the campaign to test a variety of messages, both positive and negative, on themselves and their opponent(s). This lets the campaign know what messages work best with certain demographics and what messages should be avoided. Campaigns often use these polls to test possible attack messages that their opponent
Opinion poll may use and potential responses to those attacks. The campaign can then spend some time preparing an effective response to any likely attacks. Thirdly, this kind of poll can be used by candidates or political parties to convince primary challengers to drop out of a race and support a stronger candidate.
230
Tracking polls
A tracking poll is a poll repeated at intervals generally averaged over a trailing window.[] For example, a weekly tracking poll uses the data from the past week and discards older data. A caution is that estimating the trend is more difficult and error-prone than estimating the level intuitively, if one estimates the change, the difference between two numbers X and Y, then one has to contend with the error in both X and Y it is not enough to simply take the difference, as the change may be random noise. For details, see t-test. A rough guide is that if the change in measurement falls outside the margin of error, it is worth attention.
Nonresponse bias
Since some people do not answer calls from strangers, or refuse to answer the poll, poll samples may not be representative samples from a population due to a non-response bias. Because of this selection bias, the characteristics of those who agree to be interviewed may be markedly different from those who decline. That is, the actual sample is a biased version of the universe the pollster wants to analyze. In these cases, bias introduces new errors, one way or the other, that are in addition to errors caused by sample size. Error due to bias does not become
Opinion poll smaller with larger sample sizes, because taking a larger sample size simply repeats the same mistake on a larger scale. If the people who refuse to answer, or are never reached, have the same characteristics as the people who do answer, then the final results should be unbiased. If the people who do not answer have different opinions then there is bias in the results. In terms of election polls, studies suggest that bias effects are small, but each polling firm has its own techniques for adjusting weights to minimize selection bias.[]
231
Response bias
Survey results may be affected by response bias, where the answers given by respondents do not reflect their true beliefs. This may be deliberately engineered by unscrupulous pollsters in order to generate a certain result or please their clients, but more often is a result of the detailed wording or ordering of questions (see below). Respondents may deliberately try to manipulate the outcome of a poll by e.g. advocating a more extreme position than they actually hold in order to boost their side of the argument or give rapid and ill-considered answers in order to hasten the end of their questioning. Respondents may also feel under social pressure not to give an unpopular answer. For example, respondents might be unwilling to admit to unpopular attitudes like racism or sexism, and thus polls might not reflect the true incidence of these attitudes in the population. In American political parlance, this phenomenon is often referred to as the Bradley effect. If the results of surveys are widely publicized this effect may be magnified - a phenomenon commonly referred to as the spiral of silence.
Wording of questions
It is well established that the wording of the questions, the order in which they are asked and the number and form of alternative answers offered can influence results of polls. For instance, the public is more likely to indicate support for a person who is described by the operator as one of the "leading candidates". This support itself overrides subtle bias for one candidate, as does lumping some candidates in an "other" category or vice versa. Thus comparisons between polls often boil down to the wording of the question. On some issues, question wording can result in quite pronounced differences between surveys.[6][7][8] This can also, however, be a result of legitimately conflicted feelings or evolving attitudes, rather than a poorly constructed survey.[9] A common technique to control for this bias is to rotate the order in which questions are asked. Many pollsters also split-sample. This involves having two different versions of a question, with each version presented to half the respondents. The most effective controls, used by attitude researchers, are: asking enough questions to allow all aspects of an issue to be covered and to control effects due to the form of the question (such as positive or negative wording), the adequacy of the number being established quantitatively with psychometric measures such as reliability coefficients, and analyzing the results with psychometric techniques which synthesize the answers into a few reliable scores and detect ineffective questions. These controls are not widely used in the polling industry.Wikipedia:Please clarify
Opinion poll
232
Coverage bias
Another source of error is the use of samples that are not representative of the population as a consequence of the methodology used, as was the experience of the Literary Digest in 1936. For example, telephone sampling has a built-in error because in many times and places, those with telephones have generally been richer than those without. In some places many people have only mobile telephones. Because pollsters cannot call mobile phones (it is unlawful in the United States to make unsolicited calls to phones where the phone's owner may be charged simply for taking a call), these individuals are typically excluded from polling samples. There is concern that, if the subset of the population without cell phones differs markedly from the rest of the population, these differences can skew the results of the poll. Polling organizations have developed many weighting techniques to help overcome these deficiencies, with varying degrees of success. Studies of mobile phone users by the Pew Research Center in the US, in 2007, concluded that "cell-only respondents are different from landline respondents in important ways, (but) they were neither numerous enough nor different enough on the questions we examined to produce a significant change in overall general population survey estimates when included with the landline samples and weighted according to US Census parameters on basic demographic characteristics."[] This issue was first identified in 2004,[] but came to prominence only during the 2008 US presidential election.[] In previous elections, the proportion of the general population using cell phones was small, but as this proportion has increased, there is concern that polling only landlines is no longer representative of the general population. In 2003, only 2.9% of households were wireless (cellphones only), compared to 12.8% in 2006.[] This results in "coverage error". Many polling organisations select their sample by dialling random telephone numbers; however, in 2008, there was a clear tendency for polls which included mobile phones in their samples to show a much larger lead for Obama, than polls that did not.[][] The potential sources of bias are:[] 1. Some households use cellphones only and have no landline. This tends to include minorities and younger voters; and occurs more frequently in metropolitan areas. Men are more likely to be cellphone-only compared to women. 2. Some people may not be contactable by landline from Monday to Friday and may be contactable only by cellphone. 3. Some people use their landlines only to access the Internet, and answer calls only to their cellphones. Some polling companies have attempted to get around that problem by including a "cellphone supplement". There are a number of problems with including cellphones in a telephone poll: 1. It is difficult to get co-operation from cellphone users, because in many parts of the US, users are charged for both outgoing and incoming calls. That means that pollsters have had to offer financial compensation to gain co-operation. 2. US federal law prohibits the use of automated dialling devices to call cellphones (Telephone Consumer Protection Act of 1991). Numbers therefore have to be dialled by hand, which is more time-consuming and expensive for pollsters. An oft-quoted example of opinion polls succumbing to errors occurred during the UK General Election of 1992. Despite the polling organizations using different methodologies, virtually all the polls taken before the vote, and to a lesser extent, exit polls taken on voting day, showed a lead for the opposition Labour party, but the actual vote gave a clear victory to the ruling Conservative party. In their deliberations after this embarrassment the pollsters advanced several ideas to account for their errors, including: Late swing Voters who changed their minds shortly before voting tended to favour the Conservatives, so the error was not as great as it first appeared. Nonresponse bias
Opinion poll Conservative voters were less likely to participate in surveys than in the past and were thus under-represented. The Shy Tory Factor The Conservatives had suffered a sustained period of unpopularity as a result of economic difficulties and a series of minor scandals, leading to a spiral of silence in which some Conservative supporters were reluctant to disclose their sincere intentions to pollsters. The relative importance of these factors was, and remains, a matter of controversy, but since then the polling organizations have adjusted their methodologies and have achieved more accurate results in subsequent election campaigns.[citation needed]
233
Failures
A widely publicized failure of opinion polling to date in the United States was the prediction that Thomas Dewey would defeat Harry S. Truman in the 1948 US presidential election. Major polling organizations, including Gallup and Roper, indicated a landslide victory for Dewey. In the United Kingdom, most polls failed to predict the Conservative election victories of 1970 and 1992, and Labour's victory in 1974. However, their figures at other elections have been generally accurate.
Influence
Effect on voters
By providing information about voting intentions, opinion polls can sometimes influence the behavior of electors, and in his book The Broken Compass, Peter Hitchens asserts that opinion polls are actually a device for influencing public opinion.[] The various theories about how this happens can be split into two groups: bandwagon/underdog effects, and strategic ("tactical") voting. A bandwagon effect occurs when the poll prompts voters to back the candidate shown to be winning in the poll. The idea that voters are susceptible to such effects is old, stemming at least from 1884; William Safire reported that the term was first used in a political cartoon in the magazine Puck in that year.[10] It has also remained persistent in spite of a lack of empirical corroboration until the late 20th century. George Gallup spent much effort in vain trying to discredit this theory in his time by presenting empirical research. A recent meta-study of scientific research on this topic indicates that from the 1980s onward the Bandwagon effect is found more often by researchers.[11] The opposite of the bandwagon effect is the underdog effect. It is often mentioned in the media. This occurs when people vote, out of sympathy, for the party perceived to be "losing" the elections. There is less empirical evidence for the existence of this effect than there is for the existence of the bandwagon effect.[11] The second category of theories on how polls directly affect voting is called strategic or tactical voting. This theory is based on the idea that voters view the act of voting as a means of selecting a government. Thus they will sometimes not choose the candidate they prefer on ground of ideology or sympathy, but another, less-preferred, candidate from strategic considerations. An example can be found in the United Kingdom general election, 1997. As he was then a Cabinet Minister, Michael Portillo's constituency of Enfield Southgate was believed to be a safe seat but opinion polls showed the Labour candidate Stephen Twigg steadily gaining support, which may have prompted undecided voters or supporters of other parties to support Twigg in order to remove Portillo. Another example is the boomerang effect where the likely supporters of the candidate shown to be winning feel that chances are slim and that their vote is not required, thus allowing another candidate to win. In addition, Mark Pickup in Cameron Anderson and Laura Stephenson's "Voting Behaviour in Canada" outlines three additional "behavioural" responses that voters may exhibit when faced with polling data. The first is known as a "cue taking" effect which holds that poll data is used as a "proxy" for information about the candidates or parties. Cue taking is "based on the psychological phenomenon of using heuristics to simplify a
Opinion poll complex decision" (243).[12] The second, first described by Petty and Cacioppo (1996) is known as "cognitive response" theory. This theory asserts that a voter's response to a poll may not line with their initial conception of the electoral reality. In response, the voter is likely to generate a "mental list" in which they create reasons for a party's loss or gain in the polls. This can reinforce or change their opinion of the candidate and thus affect voting behaviour. Third, the final possibility is a "behavioural response" which is similar to a cognitive response. The only salient difference is that a voter will go and seek new information to form their "mental list," thus becoming more informed of the election. This may then affect voting behaviour. These effects indicate how opinion polls can directly affect political choices of the electorate. But directly or indirectly, other effects can be surveyed and analyzed on all political parties. The form of media framing and party ideology shifts must also be taken under consideration. Opinion polling in some instances is a measure of cognitive bias, which is variably considered and handled appropriately in its various applications.
234
Effect on politicians
Starting in the 1980s, tracking polls and related technologies began having a notable impact on U.S. political leaders.[] According to Douglas Bailey, a Republican who had helped run Gerald Ford's 1976 presidential campaign, "It's no longer necessary for a political candidate to guess what an audience thinks. He can [find out] with a nightly tracking poll. So it's no longer likely that political leaders are going to lead. Instead, they're going to follow."[]
Regulation
Some jurisdictions over the world restrict the publication of the results of opinion polls in order to prevent the possibly erroneous results from affecting voters' decisions. For instance, in Canada, it is prohibited to publish the results of opinion surveys that would identify specific political parties or candidates in the final three days before a poll closes.[] However, most western democratic nations don't support the entire prohibition of the publication of pre-election opinion polls; most of them have no regulation and some only prohibit it in the final days or hours until the relevant poll closes.[] A survey by Canada's Royal Commission on Electoral Reform reported that the prohibition period of publication of the survey results largely differed in different countries. Out of the 20 countries examined, three prohibit the publication during the entire period of campaigns, while others prohibit it for a shorter term such as the polling period or the final 48 hours before a poll closes.[]
Footnotes
[1] [2] [4] [5] [8] Kenneth F. Warren (1992). "in Defense of Public Opinion Polling." Westview Press. p. 200-1. An estimate of the margin of error in percentage terms can be gained by the formula 100 square root of sample size Lynch, Scott M. Introduction to Bayesian Statistics and Estimation for Social Scientists (2007). http:/ / www. daytodaypolitics. com/ polls/ presidential_election_Obama_vs_McCain_2008. htm "Public Agenda Issue Guide: Abortion - Public View - Red Flags" (http:/ / www. publicagenda. org/ citizen/ issueguides/ abortion/ publicview/ redflags). Public Agenda. [10] Safire, William, Safire's Political Dictionary, page 42. Random House, 1993. [11] Irwin, Galen A. and Joop J. M. Van Holsteyn. Bandwagons, Underdogs, the Titanic and the Red Cross: The Influence of Public Opinion Polls on Voters (2000).
Opinion poll
235
External references
Asher, Herbert: Polling and the Public. What Every Citizen Should Know, fourth edition. Washington, D.C.: CQ Press, 1998. Bourdieu, Pierre, "Public Opinion does not exist" in Sociology in Question, London, Sage (1995). Bradburn, Norman M. and Seymour Sudman. Polls and Surveys: Understanding What They Tell Us (1988). Cantril, Hadley. Gauging Public Opinion (1944). Cantril, Hadley and Mildred Strunk, eds. Public Opinion, 1935-1946 (1951) (http://www.questia.com/PM. qst?a=o&d=98754501), massive compilation of many public opinion polls from US, UK, Canada, Australia, and elsewhere. Converse, Jean M. Survey Research in the United States: Roots and Emergence 1890-1960 (1987), the standard history. Crespi, Irving. Public Opinion, Polls, and Democracy (1989) (http://www.questia.com/PM.qst?a=o& d=8971691). Gallup, George. Public Opinion in a Democracy (1939). Gallup, Alec M. ed. The Gallup Poll Cumulative Index: Public Opinion, 1935-1997 (1999) lists 10,000+ questions, but no results. Gallup, George Horace, ed. The Gallup Poll; Public Opinion, 1935-1971 3 vol (1972) summarizes results of each poll. Glynn, Carroll J., Susan Herbst, Garrett J. O'Keefe, and Robert Y. Shapiro. Public Opinion (1999) (http://www. questia.com/PM.qst?a=o&d=100501261) textbook Lavrakas, Paul J. et al. eds. Presidential Polls and the News Media (1995) (http://www.questia.com/PM. qst?a=o&d=28537852) Moore, David W. The Superpollsters: How They Measure and Manipulate Public Opinion in America (1995) (http://www.questia.com/PM.qst?a=o&d=8540600). Niemi, Richard G., John Mueller, Tom W. Smith, eds. Trends in Public Opinion: A Compendium of Survey Data (1989) (http://www.questia.com/PM.qst?a=o&d=28621255). Oskamp, Stuart and P. Wesley Schultz; Attitudes and Opinions (2004) (http://www.questia.com/PM.qst?a=o& d=104829752). Robinson, Claude E. Straw Votes (1932). Robinson, Matthew Mobocracy: How the Media's Obsession with Polling Twists the News, Alters Elections, and Undermines Democracy (2002). Rogers, Lindsay. The Pollsters: Public Opinion, Politics, and Democratic Leadership (1949) (http://www. questia.com/PM.qst?a=o&d=89021667). Traugott, Michael W. The Voter's Guide to Election Polls (http://www.questia.com/PM.qst?a=o& d=71288534) 3rd ed. (2004). James G. Webster, Patricia F. Phalen, Lawrence W. Lichty; Ratings Analysis: The Theory and Practice of Audience Research Lawrence Erlbaum Associates, 2000. Young, Michael L. Dictionary of Polling: The Language of Contemporary Opinion Research (1992) (http:// www.questia.com/PM.qst?a=o&d=59669912). Additional Sources Walden, Graham R. Survey Research Methodology, 1990-1999: An Annotated Bibliography. Bibliographies and Indexes in Law and Political Science Series. Westport, CT: Greenwood Press, Greenwood Publishing Group, Inc., 2002. xx, 432p. Walden, Graham R. Public Opinion Polls and Survey Research: A Selective Annotated Bibliography of U.S. Guides and Studies from the 1980s. Public Affairs and Administrative Series, edited by James S. Bowman, vol. 24. New York, NY: Garland Publishing Inc., 1990. xxix, 360p.
Opinion poll Walden, Graham R. Polling and Survey Research Methods 1935-1979: An Annotated Bibliography. Bibliographies and Indexes in Law and Political Science Series, vol. 25. Westport, CT: Greenwood Publishing Group, Inc., 1996. xxx, 581p.
236
External links
Polls (http://ucblibraries.colorado.edu/govpubs/us/polls.htm) from UCB Libraries GovPubs The Pew Research Center (http://www.pewresearch.org) nonpartisan "fact tank" providing information on the issues, attitudes and trends shaping America and the world by conducting public opinion polling and social science research "Use Opinion Research To Build Strong Communication" (http://www.gcastrategies.com/books_articles/ article_001_or.php) by Frank Noto Public Agenda for Citizens (http://www.publicagenda.org/) nonpartisan, nonprofit group that tracks public opinion data in the United States National Council on Public Polls (http://www.ncpp.org/?q=home) association of polling organizations in the United States devoted to setting high professional standards for surveys How Will America Vote (http://howwillamericavote.com) Aggregates polling data with demographic sub-samples. USA Election Polls (http://www.usaelectionpolls.com) tracks the public opinion polls related to elections in the US Survey Analysis Tool (http://www.i-marvin.si) based on A. Berkopec, HyperQuick algorithm for discrete hypergeometric distribution, Journal of Discrete Algorithms, Elsevier, 2006 (http://dx.doi.org/10.1016/j.jda. 2006.01.001). "Poll Position - Issue 010 - GOOD" (http://www.good.is/post/poll_position/), track record of pollsters for USA presidential elections in Good magazine, April 23, 2008.
237
References
Yarnold, Paul R.; Soltysik, Robert C. (2004). Optimal Data Analysis (http://books.apa.org/books. cfm?id=4316000). American Psychologicla Association. ISBN1-55798-981-8. Fisher, R. A. (1936). "The Use of Multiple Measurements in Taxonomic Problems". Annals of Eugenics 7 (2): 179188. doi: 10.1111/j.1469-1809.1936.tb02137.x (http://dx.doi.org/10.1111/j.1469-1809.1936.tb02137. x). hdl: 2440/15227 (http://hdl.handle.net/2440/15227). Martinez, A. M.; Kak, A. C. (2001). "PCA versus LDA" (http://www.ece.osu.edu/~aleix/pami01f.pdf). IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2): 228233. doi: 10.1109/34.908974 (http://dx. doi.org/10.1109/34.908974). Mika, S. et al. (1999). "Fisher Discriminant Analysis with Kernels" (http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.35.9904). IEEE Conference on Neural Networks for Signal Processing IX: 4148. doi: 10.1109/NNSP.1999.788121 (http://dx.doi.org/10.1109/NNSP.1999.788121).
External links
LDA tutorial using MS Excel (http://people.revoledu.com/kardi/tutorial/LDA/index.html) IMSL discriminant analysis function DSCRM (http://www.roguewave.com/Portals/0/products/ imsl-numerical-libraries/fortran-library/docs/7.0/stat/stat.htm), which has many useful mathematical definitions.
Pairwise comparison
238
Pairwise comparison
Pairwise comparison generally refers to any process of comparing entities in pairs to judge which of each entity is preferred, or has a greater amount of some quantitative property. The method of pairwise comparison is used in the scientific study of preferences, attitudes, voting systems, social choice, public choice, and multiagent AI systems. In psychology literature, it is often referred to as paired comparison. Prominent psychometrician L. L. Thurstone first introduced a scientific approach to using pairwise comparisons for measurement in 1927, which he referred to as the law of comparative judgment. Thurstone linked this approach to psychophysical theory developed by Ernst Heinrich Weber and Gustav Fechner. Thurstone demonstrated that the method can be used to order items along a dimension such as preference or importance using an interval-type scale.
Overview
If an individual or organization expresses a preference between two mutually distinct alternatives, this preference can be expressed as a pairwise comparison. If the two alternatives are x and y, the following are the possible pairwise comparisons: The agent prefers x over y: "x>y" or "xPy" The agent prefers y over x: "y>x" or "yPx" The agent is indifferent between both alternatives: "x=y" or "xIy"
Probabilistic models
In terms of modern psychometric theory, Thurstone's approach, called the law of comparative judgment, is more aptly regarded as a measurement model. The BradleyTerryLuce (BTL) model (Bradley & Terry, 1952; Luce, 1959) is often applied to pairwise comparison data to scale preferences. The BTL model is identical to Thurstone's model if the simple logistic function is used. Thurstone used the normal distribution in applications of the model. The simple logistic function varies by less than 0.01 from the cumulative normal ogive across the range, given an arbitrary scale factor. In the BTL model, the probability that object j is judged to have more of an attribute than object i is:
where
might represent the perceived quality of a product, or the perceived weight of an object. The BTL is very closely related to the Rasch model for measurement. Thurstone used the method of pairwise comparisons as an approach to measuring perceived intensity of physical stimuli, attitudes, preferences, choices, and values. He also studied implications of the theory he developed for opinion polls and political voting (Thurstone, 1959).
Pairwise comparison
239
Transitivity
For a given decision agent, if the information, objective, and alternatives used by the agent remain constant, then it is generally assumed that pairwise comparisons over those alternatives by the decision agent are transitive. Most agree upon what transitivity is, though there is debate about the transitivity of indifference. The rules of transitivity are as follows for a given decision agent. If xPy and yPz, then xPz If xPy and yIz, then xPz If xIy and yPz, then xPz If xIy and yIz, then xIz This corresponds to (xPy or xIy) being a total preorder, P being the corresponding strict weak order, and I being the corresponding equivalence relation. Probabilistic models require transitivity only within the bounds of errors of estimates of scale locations of entities. Thus, decisions need not be deterministically transitive in order to apply probabilistic models. However, transitivity will generally hold for a large number of comparisons if models such as the BTL can be effectively applied. Using Transitivity test[1] one can investigate whether a data set of pairwise comparisons contains a higher degree of transitivity than expected by chance.
Preference orders
If pairwise comparisons are in fact transitive in respect to the four mentioned rules, then pairwise comparisons for a list of alternatives (A1,A2,A3,...,An1, and An) can take the form: A1(>XOR=)A2(>XOR=)A3(>XOR=)...(>XOR=)An1(>XOR=)An For example, if there are three alternatives a, b, and c, then the possible preference orders are:
Pairwise comparison If the number of alternatives is n, and indifference is not allowed, then the number of possible preference orders for any given n-value isn!. If indifference is allowed, then the number of possible preference orders is the number of total preorders. It can be expressed as a function of n:
240
Applications
One important application of pairwise comparisons is the widely used Analytic Hierarchy Process, a structured technique for helping people deal with complex decisions. It uses pairwise comparisons of tangible and intangible factors to construct ratio scales that are useful in making important decisions.[1][]
References
[1] Nikoli D (2012) Non-parametric detection of temporal order across pairwise measurements of time delays. Journal of Computational Neuroscience, 22(1)" pp. 519. http:/ / www. danko-nikolic. com/ wp-content/ uploads/ 2011/ 09/ Nikolic-Transitivity-2007. pdf
" Sloane's A000142 : Factorial numbers (http://oeis.org/A000142)", The On-Line Encyclopedia of Integer Sequences. OEIS Foundation. " Sloane's A000670 : Number of preferential arrangements of n labeled elements (http://oeis.org/A000670)", The On-Line Encyclopedia of Integer Sequences. OEIS Foundation. Y. Chevaleyre, P.E. Dunne, U. Endriss, J. Lang, M. Lematre, N. Maudet, J. Padget, S. Phelps, J.A. Rodrguez-Aguilar, and P. Sousa. Issues in Multiagent Resource Allocation. Informatica, 30:331, 2006.
Further reading
How to Analyze Paired Comparison Data (http://www.ee.washington.edu/research/guptalab/publications/ PairedComparisonTutorialTsukidaGuptaUWTechReport2011.pdf) Bradley, R.A. and Terry, M.E. (1952). Rank analysis of incomplete block designs, I. the method of paired comparisons. Biometrika, 39, 324345. David, H.A. (1988). The Method of Paired Comparisons. New York: Oxford University Press. Luce, R.D. (1959). Individual Choice Behaviours: A Theoretical Analysis. New York: J. Wiley. Thurstone, L.L. (1927). A law of comparative judgement. Psychological Review, 34, 278286. Thurstone, L.L. (1929). The Measurement of Psychological Value. In T.V. Smith and W.K. Wright (Eds.), Essays in Philosophy by Seventeen Doctors of Philosophy of the University of Chicago. Chicago: Open Court. Thurstone, L.L. (1959). The Measurement of Values. Chicago: The University of Chicago Press.
Pathfinder network
241
Pathfinder network
Several psychometric scaling methods start from proximity data and yield structures revealing the underlying organization of the data. Data clustering and multidimensional scaling are two such methods. Network scaling represents another method based on graph theory. Pathfinder networks are derived from proximities for pairs of entities. Proximities can be obtained from similarities, correlations, distances, conditional probabilities, or any other measure of the relationships among entities. The entities are often concepts of some sort, but they can be anything with a pattern of relationships. In the Pathfinder network, the entities correspond to the nodes of the generated network, and the links in the network are determined by the patterns of proximities. For example, if the proximities are similarities, links will generally connect nodes of high similarity. The links in the network will be undirected if the proximities are symmetrical for every pair of entities. Symmetrical proximities mean that the order of the entities is not important, so the proximity of i and j is the same as the proximity of j and i for all pairs i,j. If the proximities are not symmetrical for every pair, the links will be directed. Here is an example of an undirected Pathfinder network derived from average similarity ratings of a group of biology graduate students. The students rated the similarity of all pairs of the terms shown.
Pathfinder uses two parameters. (1) The q parameter constrains the number of indirect proximities examined in generating the network. The q parameter is an integer value between 2 and n1, inclusive where n is the number of nodes or items. (2) The r parameter defines the metric used for computing the distance of paths (cf. the Minkowski distance). The r parameter is a real number between 1 and infinity, inclusive. A network generated with particular values of q and r is called a PFnet(q,r). Both of the parameters have the effect of decreasing the number of links in the network as their values are increased. The network with the minimum number of links is obtained when q=n1 and r=, i.e., PFnet(n1,). With ordinal-scale data (see level of measurement), the r-parameter should be infinity because the same PFnet would result from any positive monotonic transformation of the proximity data. Other values of r require data measured on a ratio scale. The q parameter can be varied to yield the desired number of links in the network.
Pathfinder network Essentially, Pathfinder networks preserve the shortest possible paths given the data so links are eliminated when they are not on shortest paths. The PFnet(n1,) will be the minimum spanning tree for the links defined by the proximity data if a unique minimum spanning tree exists. In general, the PFnet(n1,) includes all of the links in any minimum spanning tree. Pathfinder networks are used in the study of expertise, knowledge acquisition, knowledge engineering, citation patterns, information retrieval, and data visualization. The networks are potentially applicable to any problem addressed by network theory.
242
References
Further information on Pathfinder networks and several examples of the application of PFnets to a variety of problems can be found in: Schvaneveldt, R. W. (Ed.) (1990) Pathfinder Associative Networks: Studies in Knowledge Organization. Norwood, NJ: Ablex. The book is out of print. A copy can be downloaded: pdf [1] A shorter article summarizing Pathfinder networks: Schvaneveldt, R. W., Durso, F. T., & Dearholt, D. W. (1989). Network structures in proximity data. In G. Bower (Ed.), The psychology of learning and motivation: Advances in research and theory, Vol. 24 (pp.249284). New York: Academic Press. pdf [2] Three papers describing fast implementations of Pathfinder networks: Guerrero-Bote, V.; Zapico-Alonso, F.; Esinosa-Calvo, M.; Gomez-Crisostomo, R.; Moya-Anegon, F. (2006). "Binary pathfinder: An improvement to the pathfinder algorithm". Information Processing and Management 42 (6): 14841490. doi:10.1016/j.ipm.2006.03.015 [3]. Quirin, A; Cordn, O; Santamara, J; Vargas-Quesada, B; Moya-Anegn, F (2008). "A new variant of the Pathfinder algorithm to generate large visual science maps in cubic time". Information Processing and Management 44 (4): 16111623. doi:10.1016/j.ipm.2007.09.005 [4]. Quirin, A.; Cordn, O.; Guerrero-Bote, V. P.; Vargas-Quesada, B.; Moya-Anegn, F. (2008). "A Quick MST-based Algorithm to Obtain Pathfinder Networks". Journal of the American Society for Information Science and Technology 59 (12): 19121924. doi:10.1002/asi.20904 [5]. (Quirin et al. is significantly faster, but can only be applied in cases where q=n1, while Guerrero-Bote et al. can be use for all cases.)
External links
Interlink [6] Implementation of the MST-Pathfinder algorithm in C++ [7]
References
[1] [2] [3] [4] [5] [6] [7] http:/ / interlinkinc. net/ PFBook. zip http:/ / www. interlinkinc. net/ Roger/ Papers/ Schvaneveldt_Durso_Dearholt_1989. pdf http:/ / dx. doi. org/ 10. 1016%2Fj. ipm. 2006. 03. 015 http:/ / dx. doi. org/ 10. 1016%2Fj. ipm. 2007. 09. 005 http:/ / dx. doi. org/ 10. 1002%2Fasi. 20904 http:/ / www. interlinkinc. net http:/ / aquirin. ovh. org/ research/ mstpathfinder. html
Perceptual mapping
243
Perceptual mapping
Perceptual mapping is a diagrammatic technique used by asset marketers that attempts to visually display the perceptions of customers or potential customers. Typically the position of a product, product line, brand, or company is displayed relative to their competition. Perceptual maps can have any number of dimensions but the most common is two dimensions. The first perceptual map below shows consumer perceptions of various automobiles on the two dimensions of sportiness/conservative and classy/affordable. This sample of consumers felt Porsche was the sportiest and classiest of the cars in the study (top right corner). They felt Plymouth was most practical and conservative (bottom left corner).
Cars that are positioned close to each other are seen as similar on the relevant dimensions by the consumer. For example consumers see Buick, Chrysler, and Oldsmobile as similar. They are close competitors and form a competitive grouping. A company considering the introduction of a new model will look for an area on the map free from competitors. Some perceptual maps use different size circles to indicate the sales volume or market share of the various competing products. Displaying consumers perceptions of related products is only half the story. Many perceptual maps also display consumers ideal points. These points reflect ideal combinations of the two dimensions as seen by a consumer. The next diagram shows a study of consumers ideal points in the alcohol/spirits product space. Each dot represents one respondent's ideal combination of the two dimensions. Areas where there is a cluster of ideal points (such as A) indicates a market segment. Areas without ideal points are sometimes referred to as demand voids.
A company considering introducing a new product will look for areas with a high density of ideal points. They will also look for areas without competitive rivals. This is best done by placing both the ideal points and the competing products on the same map. Some maps plot ideal vectors instead of ideal points. The map below, displays various aspirin products as seen on the dimensions of effectiveness and gentleness. It also shows two ideal vectors. The slope of the ideal vector indicates the preferred ratio of the two dimensions by those consumers within that segment. This study indicates
Perceptual mapping there is one segment that is more concerned with effectiveness than harshness, and another segment that is more interested in gentleness than strength.
244
Perceptual maps need not come from a detailed study. There are also intuitive maps (also called judgmental maps or consensus maps) that are created by marketers based on their understanding of their industry. Management uses its best judgment. It is questionable how valuable this type of map is. Often they just give the appearance of credibility to managements preconceptions. When detailed marketing research studies are done methodological problems can arise, but at least the information is coming directly from the consumer. There is an assortment of statistical procedures that can be used to convert the raw data collected in a survey into a perceptual map. Preference regression will produce ideal vectors. Multi dimensional scaling will produce either ideal points or competitor positions. Factor analysis, discriminant analysis, cluster analysis, and logit analysis can also be used. Some techniques are constructed from perceived differences between products, others are constructed from perceived similarities. Still others are constructed from cross price elasticity of demand data from electronic scanners.
Person-fit analysis
245
Person-fit analysis
Person-fit analysis is a technique for determining if the person's results on a given test are valid. The purpose of a person-fit analysis is to detect item-score vectors that are unlikely given a hypothesized test theory model such as item response theory, or unlikely compared with the majority of item-score vectors in the sample. An item-score vector is a list of "scores" that a person gets on the items of a test, where "1" is often correct and "0" is incorrect. For example, if a person took a 10-item quiz and only got the first five correct, the vector would be {1111100000}. In individual decision-making in education, psychology, and personnel selection, it is critically important that test users can have confidence in the test scores used. The validity of individual test scores may be threatened when the examinee's answers are governed by factors other than the psychological trait of interest - factors that can range from something as benign as the examinee dozing off to concerted fraud efforts. Person-fit methods are used to detect item-score vectors where such external factors may be relevant, and as a result, indicate invalid measurement. Unfortunately, person-fit statistics only tell if the set of responses is likely or unlikely, and cannot prove anything. The results of the analysis might look like an examinee cheated, but there is no way to go back to when the test was administered and prove it. This limits its practical applicability on an individual scale. However, it might be useful on a larger scale; if most examinees at a certain test site or with a certain proctor have unlikely responses, an investigation might be warranted.
References
Emons, W.H.M., Sijtsma, K., & Meijer, R.R. (2005). Global, local and graphical person-fit analysis using person response functions. Psychological Methods, 10(1), 101-119. Emons, W.H.M., Glas, C.A.W., Meijer, R.R., & Sijtsma, K. (2003). Person fit in order-restricted latent class models. Applied Psychological Measurement, 27(6), 459-478. Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person-fit. Applied Psychological Measurement, 25, 107-135.
Phrase completions
246
Phrase completions
Phrase completion scales are a type of psychometric scale used in questionnaires. Developed in response to the problems associated with Likert scales, Phrase completions are concise, unidimensional measures that tap ordinal level data in a manner that approximates interval level data.
Level of measurement
The response categories represent an ordinal level of measurement. Ordinal level data, however, varies in terms of how closely it approximates interval level data. By using a numerical continuum as the response key instead of sentiments that reflect intensity of agreement, respondents may be able to quantify their responses in more equal units.
References
Hodge, D. R. & Gillespie, D. F. (2003). Phrase Completions: An alternative to Likert scales. Social Work Research, 27(1), 45-55. Hodge, D. R. & Gillespie, D. F. (2005). Phrase Completion Scales. In K. Kempf-Leonard (Editor). Encyclopedia of Social Measurement. (Vol. 3, pp. 53-62). San Diego: Academic Press. Hodge, D. R. & Gillespie, D. F. (2007). Phrase Completion Scales: A Better Measurement Approach than Likert Scales? Journal of Social Service Research, 33, (4), 1-12.
247
where sn is the standard deviation used when you have data for every member of the population:
M1 being the mean value on the continuous variable X for all data points in group 1, and M0 the mean value on the continuous variable X for all data points in group 2. Further, n1 is the number of data points in group 1, n0 is the number of data points in group 2 and n is the total sample size. This formula is a computational formula that has been derived from the formula for rXY in order to reduce steps in the calculation; it is easier to compute than rXY. There is an equivalent formula that uses sn1:
where sn1 is the standard deviation used when you only have data for a sample of the population:
It's important to note that this is merely an equivalent formula. It is not a formula for use in the case where you only have sample data. There is no version of the formula for a case where you only have sample data. The version of the formula using sn1 is useful if you are calculating point-biserial correlation coefficients in a programming language or other development environment where you have a function available for calculating sn1, but don't have a function available for calculating sn. To clarify:
Glass and Hopkins' book Statistical Methods in Education and Psychology, (3rd Edition)[1] contains a correct version of point biserial formula. Also the square of the point biserial correlation coefficient can be written:
Point-biserial correlation coefficient We can test the null hypothesis that the correlation is zero in the population. A little algebra shows that the usual formula for assessing the significance of a correlation coefficient, when applied to rpb, is the same as the formula for an unpaired t-test and so
248
follows Student's t-distribution with (n1+n0 - 2) degrees of freedom when the null hypothesis is true. One disadvantage of the point biserial coefficient is that the further the distribution of Y is from 50/50, the more constrained will be the range of values which the coefficient can take. If X can be assumed to be normally distributed, a better descriptive index is given by the biserial coefficient
where u is the ordinate of the normal distribution with zero mean and unit variance at the point which divides the distribution into proportions n0/n and n1/n. As you might imagine, this is not the easiest thing in the world to calculate and the biserial coefficient is not widely used in practice. A specific case of biserial correlation occurs where X is the sum of a number of dichotomous variables of which Y is one. An example of this is where X is a person's total score on a test composed of n dichotomously scored items. A statistic of interest (which is a discrimination index) is the correlation between responses to a given item and the corresponding total test scores. There are three computations in wide use,[2] all called the point-biserial correlation: (i) the Pearson correlation between item scores and total test scores including the item scores, (ii) the Pearson correlation between item scores and total test scores excluding the item scores, and (iii) a correlation adjusted for the bias caused by the inclusion of item scores in the test scores. Correlation (iii) is
A slightly different version of the point biserial coefficient is the rank biserial which occurs where the variable X consists of ranks while Y is dichotomous. We could calculate the coefficient in the same way as where X is continuous but it would have the same disadvantage that the range of values it can take on becomes more constrained as the distribution of Y becomes more unequal. To get round this, we note that the coefficient will have its largest value where the smallest ranks are all opposite the 0s and the largest ranks are opposite the 1s. Its smallest value occurs where the reverse is the case. These values are respectively plus and minus (n1+n0)/2. We can therefore use the reciprocal of this value to rescale the difference between the observed mean ranks on to the interval from plus one to minus one. The result is
where M1 and M0 are respectively the means of the ranks corresponding to the 1 and 0 scores of the dichotomous variable. This formula, which simplifies the calculation from the counting of agreements and inversions, is due to Gene V Glass (1966). It is possible to use this to test the null hypothesis of zero correlation in the population from which the sample was drawn. If rrb is calculated as above then the smaller of
and
is distributed as MannWhitney U with sample sizes n1 and n0 when the null hypothesis is true.
249
External links
Point Biserial Coefficient [3] (Keith Calkins, 2005)
Notes
[3] http:/ / www. andrews. edu/ ~calkins/ math/ edrm611/ edrm13. htm#POINTB
Polychoric correlation
In statistics, polychoric correlation is a technique for estimating the correlation between two theorised normally distributed continuous latent variables, from two observed ordinal variables. Tetrachoric correlation is a special case of the polychoric correlation applicable when both observed variables are dichotomous. These names derive from the polychoric and tetrachoric series, mathematical expansions once, but no longer, used for estimation of these correlations.
Software
polycor package in R by John Fox[1] psych package in R by William Revelle[2] PRELIS POLYCORR program [3] An extensive list of software for computing the polychoric correlation, by John Uebersax [4]
References
Lee, S.-Y., Poon, W. Y., & Bentler, P. M. (1995). "A two-stage estimation of structural equation models with continuous and polytomous variables". British Journal of Mathematical and Statistical Psychology, 48, 339358. Bonett, D. G., & Price R. M. (2005). "Inferential Methods for the Tetrachoric Correlation Coefficient". Journal of Educational and Behavioral Statistics, 30, 213.
External links
The Tetrachoric and Polychoric Correlation Coefficients [4]
Polychoric correlation
250
References
[1] [2] [3] [4] http:/ / rss. acs. unt. edu/ Rdoc/ library/ polycor/ html/ polychor. html http:/ / cran. r-project. org/ web/ packages/ psych/ index. html http:/ / www. john-uebersax. com/ stat/ xpc. htm http:/ / www. john-uebersax. com/ stat/ tetra. htm
. Informally, the schema argues: a) single attributes are simple polynomials; b) if G1 and G2 are simple polynomials that are disjoint (i.e. have no attributes in common), then G1 + G2 and G1 G2 are simple polynomials; and c) no polynomials are simple except as given by a) and b). Let A, P and U be single disjoint attributes. From Krantzs (1968) schema it follows that four classes of simple polynomials in three variables exist which contain a total of eight simple polynomials: Additive: Distributive: Dual distributive: Multiplicative: ; ; plus 2 others obtained by interchanging A, P and U; plus 2 others as per above; .
Krantzs (1968) schema can be used to construct simple polynomials of greater numbers of attributes. For example, if D is a single variable disjoint to A, B, and C then three classes of simple polynomials in four variables are A + B + C + D, D + (B + AC) and D + ABC. This procedure can be employed for any finite number of variables. A simple test is that a simple polynomial can be split into either a product or sum of two smaller, disjoint simple polynomials.
Polynomial conjoint measurement These polynomials can be further split until single variables are obtained. An expression not amenable to splitting in this manner is not a simple polynomial (e.g. AB + BC + AC (Krantz & Tversky, 1971)).
251
Axioms
Let , and be non-empty and disjoint sets. Let " is a polynomial conjoint " be a simple order. Krantz et al. (1971) argued the quadruple system if and only if the following axioms hold. WEAK ORDER. SINGLE CANCELLATION. The relation " if and only if Single cancellation upon P and U is similarly defined. DOUBLE CANCELLATION. The relation " " upon and , and " satisfies single cancellation upon A whenever holds for all and .
is true for all . The condition holds similarly upon and . JOINT SINGLE CANCELLATION. The relation " " upon satisfies joint single cancellation such that if and only if is true for all and if and only if implies is true if and only if implies . . . Joint independence is similarly defined for and . DISTRIBUTIVE CANCELLATION. Distributive cancellation holds upon , and for all and . DUAL DISTRIBUTIVE CANCELLATION. Dual distributive cancellation holds upon , is true for all SOLVABILITY. The relation " and , there exists ARCHIMEDEAN CONDITION. " upon and such that , and and is solvable if and only if for all
Representation theorems
The quadruple single cancellation axiom. falls into one class of three variable simple polynomials by virtue of the joint
References
Krantz, D.H. (1968). A survey of measurement theory. In G.B. Danzig & A.F. Veinott (Eds.), Mathematics of the Decision Sciences, part 2 (pp.314-350). Providence, RI: American Mathematical Society. Krantz, D.H.; Luce, R.D; Suppes, P. & Tversky, A. (1971). Foundations of Measurement, Vol. I: Additive and polynomial representations. New York: Academic Press. Krantz, D.H. & Tversky, A. (1971). Conjoint measurement analysis of composition rules in psychology. Psychological Review, 78, 151-169. Luce, R.D. & Tukey, J.W. (1964). Simultaneous conjoint measurement: a new scale type of fundamental measurement. Journal of Mathematical Psychology, 1, 1-27. Tversky, A. (1967). A general theory of polynomial conjoint measurement. Journal of Mathematical Psychology, 4, 1-20.
252
253
The model
Firstly, let
is a random is
In the polytomous Rasch "Partial Credit" model (Masters, 1982), the probability of the outcome
where
continuum, and m is the maximum score for the item. These equations are the same as
where items.
is the kth threshold of the rating scale which is in common to all the
Applied in a given empirical context, the model can be considered a mathematical hypothesis that the probability of a given outcome is a probabilistic function of these person and item parameters. The graph showing the relation between the probability of a given category as a function of person location is referred to as a Category Probability Curve (CPC). An example of the CPCs for an item with five categories, scored from 0 to 4, is shown in Figure 1. A given threshold partitions the continuum into regions above and below its location. The threshold corresponds with the location on a latent continuum at which it is equally likely a person will be classified into adjacent categories, and therefore to obtain one of two successive scores. The first threshold of item i, , is the location on the continuum at which Figure 1: Rasch category probability curves for an item with five ordered categories a person is equally likely to obtain a score of 0 or 1, the second threshold is the location at which a person is equally likely to obtain a score of 1 and 2, and so on. In the example shown in Figure 1, the threshold locations are 1.5, 0.5, 0.5, and 1.5 respectively. Respondents may obtain scores in many different ways. For example, where Likert response formats are employed, Strongly Disagree may be assigned 0, Disagree a 1, Agree a 2, and Strongly Agree a 3. In the context of assessment in educational psychology, successively higher integer scores may be awarded according to explicit criteria or descriptions which characterise increasing levels of attainment in a specific domain, such as reading comprehension. The common and central feature is that some process must result in classification of each individual into one of a set of ordered categories that collectively comprise an assessment item.
254
be a set of independent dichotomous random variables. Andrich (1978, 2005) shows that the polytomous Rasch model requires that these dichotomous responses conform with a latent Guttman response subspace:
in which x ones are followed by m-x zeros. For example, in the case of two thresholds, the permissible patterns in this response subspace are:
where the integer score x implied by each pattern (and vice versa) is as shown. The reason this subspace is implied by the model is as follows. Let
and let
for dichotomous data. Next, consider the following conditional probability in the case two thresholds:
these equations, it can be seen that the probability in this example is conditional on response patterns of or . It is therefore evident that in general, the response subspace , as defined earlier, is intrinsic to the structure of the polytomous Rasch model. This restriction on the subspace is necessary to the justification for integer scoring of responses: i.e. such that the score is simply the count of ordered thresholds surpassed. Andrich (1978) showed that equal discrimination at each of the thresholds is also necessary to this justification. In the polytomous Rasch model, a score of x on a given item implies that an individual has simultaneously surpassed x thresholds below a certain region on the continuum, and failed to surpass the remaining mx thresholds above that region. In order for this to be possible, the thresholds must be in their natural order, as shown in the example of Figure 1. Disordered threshold estimates indicate a failure to construct an assessment context in which classifications represented by successive scores reflect increasing levels of the latent trait. For example, consider a situation in which there are two thresholds, and in which the estimate of the second threshold is lower on the continuum than the estimate of the first threshold. If the locations are taken literally, classification of a person into category 1 implies that the person's location simultaneously surpasses the second threshold but fails to surpass the first threshold. In turn, this implies a response pattern {0,1}, a pattern which does not belong to the subspace of patterns that is intrinsic to the structure of the model, as described above.
Polytomous Rasch model When threshold estimates are disordered, the estimates cannot therefore be taken literally; rather the disordering, in itself, inherently indicates that the classifications do not satisfy criteria that must logically be satisfied in order to justify the use of successive integer scores as a basis for measurement. To emphasise this point, Andrich (2005) uses an example in which grades of fail, pass, credit, and distinction are awarded. These grades, or classifications, are usually intended to represent increasing levels of attainment. Consider a person A, whose location on the latent continuum is at the threshold between regions on the continuum at which a pass and credit are most likely to be awarded. Consider also another person B, whose location is at the threshold between the regions at which a credit and distinction are most likely to be awarded. In the example considered by Andrich (2005, p.25), disordered thresholds would, if taken literally, imply that the location of person B (at the pass/credit threshold) is higher than that of person A (at the credit/distinction threshold). That is, taken literally, the disordered threshold locations would imply that a person would need to demonstrate a higher level of attainment to be at the pass/credit threshold than would be needed to be at the credit/distinction threshold. Clearly, this disagrees with the intent of such a grading system. The disordering of the thresholds would, therefore, indicate that the manner in which grades are being awarded is not in agreement with the intention of the grading system. That is, the disordering would indicate that the hypothesis implicit in the grading system - that grades represent ordered classifications of increasing performance is not substantiated by the structure of the empirical data.
255
References
Andersen, E.B. (1977). Sufficient statistics and latent trait models, Psychometrika, 42, 69-81. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-73. Andrich, D. (2005). The Rasch model explained. In Sivakumar Alagumalai, David D Durtis, and Njora Hungi (Eds.) Applied Rasch Measurement: A book of exemplars. Springer-Kluwer. Chapter 3, 308-328. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press. Wright, B.D. & Masters, G.N. (1982). Rating Scale Analysis. Chicago: MESA Press. (Available from the Institute for Objective Measurement.)
External links
Disordered thresholds and item information [1] Category Disordering and Threshold Disordering [2] Andrich on disordered thresholds and 'steps' [3] Directory of Rasch Software - freeware and paid [4] Institute for Objective Measurement [5] Rasch analysis [6] Rasch Model in Stata [7]
256
References
[1] [2] [3] [4] [5] [6] [7] http:/ / www. rasch. org/ rmt/ rmt202a. htm http:/ / www. rasch. org/ rmt/ rmt131a. htm http:/ / www. rasch. org/ rmt/ rmt1239. htm http:/ / www. rasch. org/ software. htm http:/ / www. rasch. org/ http:/ / www. rasch-analysis. com/ http:/ / www. stata. com/ support/ faqs/ stat/ rasch. html
Progress testing
Progress tests are longitudinal, feedback oriented educational assessment tools for the evaluation of development and sustainability of cognitive knowledge during a learning process. A Progress Testis a written knowledge exam (usually involving multiple choice questions) that is usually administered to all students in the a program at the same time and at regular intervals (usually twice to four times yearly) throughout the entire academic program. The test samples the complete knowledge domain expected of new graduates on completion of their course, regardless of the year level of the student. The differences between students knowledge levels show in the test scores; the further a student has progressed in the curriculum the higher the scores. As a result, these resultant scores provide a longitudinal, repeated measures, curriculum-independent assessment of the objectives (in knowledge) of the entire programme.[1]
History
Since its inception in the late 1970s at both Maastricht University [1] and the University of MissouriKansas City [2] independently, the progress test of applied knowledge has been increasingly used in medical and health sciences programs across the globe. They are well established and increasingly used in medical education in both undergraduate and postgraduate medical education. They are used formatively and summatively.
Progress testing Additionally, the longitudinal data can serve as a transparent quality assurance measure for program reviews by providing an evaluation of the extent to which a school is meeting its curriculum objectives.[1][10][25] The test also provides more reliable data for high-stakes assessment decisions by using measures of continuous learning rather than a one-shot method (Schuwirth, 2007). Inter-university progress testing collaborations provide a means of improving the cost-effectiveness of assessments by sharing a larger pool of items, item writers, reviewers, and administrators. The collaborative approach adopted by the Dutch and other consortia has enabled the progress test to become a benchmarking instrument by which to measure the quality of educational outcomes in knowledge. The success of the progress test in these ways has led to consideration of developing an international progress test.[25][26] The benefits for all main stakeholders in a medical or health sciences programme make the progress test an appealing tool to invest resources and time for inclusion in an assessment regime. This attractiveness is demonstrated by its increasingly widespread use in individual medical education institutions and inter-faculty consortia around the world, and by its use for national and international benchmarking practices.
257
Advantages
Progress tests provide a rich source of information: the comprehensive nature in combination with the cross-sectional and longitudinal design offers a wealth of information both for individual learners as well as for curriculum evaluations.[1] Progress Testing fosters knowledge retention: the repeated testing of the same comprehensive domain of knowledge means that there is no point testing facts that could be remembered if studied the night before. Long term knowledge and knowledge retention is fostered because item content remains relevant long after the knowledge has been learned. Progress Testing removes the need for resit examinations: every new test occasion is a renewed opportunity to demonstrate growth of knowledge. Progress Testing allows early detection of high achievers: some learners perform (far) beyond the expected level of their phase in training (e.g. they might have had relevant previous other training) and, depending on their performance, individual and more speeded pathways through the curriculum could be offered. Progress Testing brings stability in assessment procedures: curriculum changes, changes in content, have no consequence for the progress test provided the end outcomes are unchanged. Progress Testing provides excellent benchmarking opportunities: progress tests are not limited to a single school nor to PBL curricula and evaluations can easily be done to compare graduates and the effectiveness of different curriculum approaches.
Disadvantages
Naturally, there are disadvantages. The required resources for test development and scoring and the need for a central organization are two very important ones. Scoring,[27] psychometric procedures [28] for reducing test difficulty variation and standard setting procedures [29] are more complex in progress testing. Finally progress tests do not work in heterogeneous programs with early specialization (like in many health sciences programs). In more homogenous programs, such as most medical programs, they work really well and pay off in relation to driving learning and use of resources.
Progress testing
258
References
[1] van der Vleuten CPM, Verwijnen GM, Wijnen WHFW. 1996. Fifteen years of experience with progress testing in a problem-based learning curriculum. Medical Teacher 18(2):103110. [2] Arnold L, Willoughby TL. 1990. The quarterly profile examination. Academic Medicine 65(8):515516. [3] Swanson, D. B., Holtzman, K. Z., Butler, A., Langer, M. M., Nelson, M. V., Chow, J. W. M., et al. (2010). Collaboration across the pond: The multi-school progress testing project. Medical Teacher, 32, 480-485. [4] Schuwirth, L., Bosman, G., Henning, R. H., Rinkel, R., & Wenink, A. C. G. (2010). Collaboration on progress testing in medical schools in the Netherlands. Medical Teacher, 32, 476-479. [5] Nouns, Z. M., & Georg, W. (2010). Progress testing in German speaking countries. Medical Teacher, 32, 467-470. [6] Aarts, R., Steidel, K., Manuel, B. A. F., & Driessen, E. W. (2010). Progress testing in resource-poor countries: A case from Mozambique. Medical Teacher, 32, 461-463. [7] Al Alwan, I., Al-Moamary, M., Al-Attas, N., Al Kushi, A., ALBanyan, E., Zamakhshary, M., et al. (2011). The progress test as a diagnostic tool for a new PBL curriculum. Education for Health(December, Article No. 493). [8] Mardiastuti, H. W., & Werdhani, R. A. (2011). Grade point average, progress test, and try outs's test as tools for curriculum evaluation and graduates' performance prediciton at the national baord examination. Journal of Medicine and Medical Sciences, 2(12), 1302-1305. [9] Freeman, A., van der Vleuten, C., Nouns, Z., & Ricketts, C. (2010). Progress testing internationally. Medical Teacher, 32, 451-455. [10] De Champlain, A., Cuddy, M. M., Scoles, P. V., Brown, M., Swanson, D. B., Holtzman, K., et al. (2010). Progress testing in clinical science education: Results of a pilot project between the National Board of Medical Examiners and a US medical School. Medical Teacher, 32, 503-508. [11] International Foundations of Medicine (2011). Retrieved 20 July 2011, from http:/ / www. nbme. org/ Schools/ iFoM/ index. html
Progress testing
[12] Finucane, P., Flannery, D., Keane, D., & Norman, G. (2010). Cross-institutional progress testing: Feasibility and value to a new medical school. Medical Education, 44, 184-186. [13] Albano, M. G., Cavallo, F., Hoogenboom, R., Magni, F., Majoor, G., Manenti, F., et al. (1996). An international comparison of knowledge levels of medical students: The Maastricht progress test. Medical Education, 30, 239-245. [14] International Partnership for Progress Testing (2011). Retrieved 18 July 2011, from http:/ / ipptx. org/ [15] Bennett, J., Freeman, A., Coombes, L., Kay, L., & Ricketts, C. (2010). Adaptation of medical progress testing to a dental setting. Medical Teacher, 32, 500-502. [16] Boshuizen, H. P. A., van der Vleuten, C. P. M., Schmidt, H., & Machiels-Bongaerts, M. (1997). Measuring knowledge and clinical reasoning skills in a problem-based curriculum. Medical Education, 31, 115-121. [17] Coombes, L., Ricketts, C., Freeman, A., & Stratford, J. (2010). Beyond assessment: Feedback for individuals and institutions based on the progress test. Medical Teacher, 32, 486-490. [18] Dijksterhuis, M. G. K., Scheele, F., Schuwirth, L. W. T., Essed, G. G. M., & Nijhuis, J. G. (2009). Progress testing in postgraduate medical education. Medical Teacher, 31, e464-e468. [19] Freeman, A., & Ricketts, C. (2010). Choosing and designing knowledge assessments: Experience at a new medical school. Medical Teacher, 32, 578-581. [20] Schaap, L., Schmidt, H., & Verkoeijen, P. J. L. (2011). Assessing knowledge growth in a psychology curriculum: which students improve most? Assessment & Evaluation in Higher Education, 1-13. [21] van der Vleuten, C. P. M., Verwijnen, G. M., & Wijnen, W. H. F. W. (1996). Fifteen years of experience with progress testing in a problem-based learning curriculum. Medical Teacher, 18(2), 103-109. [22] van Diest, R., van Dalen, J., Bak, M., Schruers, K., van der Vleuten, C., Muijtjens, A. M. M., et al. (2004). Growth of knowledge in psychiatry and behavioural sciences in a problem-based learning curriculum. Medical Education, 38, 1295-1301. [23] Verhoeven, B. H., Verwijnen, G. M., Scherpbier, A. J. J. A., & van der Vleuten, C. P. M. (2002). Growth of medical knowledge. Medical Education, 36, 711-717. [24] Muijtjens, A. M. M., Timmermans, I., Donkers, J., Peperkamp, R., Medema, H., Cohen-Schotanus, J., et al. (2010). Flexible electronic feedback using the virtues of progress testing. Medical Teacher, 32, 491-495. [25] Verhoeven, B. H., Snellen-Balendong, H. A. M., Hay, I. T., Boon, J. M., Van Der Linde, M. J., Blitz-Lindeque, J. J., et al. (2005). The versatility of progress testing assessed in an international context: a start for benchmarking global standardization? Medical Teacher, 27(6), 514-520. [26] Schauber, S., & Nouns, Z. B. (2010). Using the cumulative deviation method for cross-institutional benchmarking in the Berlin progress test. Medical Teacher, 32, 471-475. [27] Muijtjens AM, Mameren HV, Hoogenboom RJ, Evers JL, van der Vleuten CP. 1999. The effect of a dont know option on test scores: Number-right and formula scoring compared. Medical Education 33(4):267275. [28] Shen L. 2000. Progress testing for postgraduate medical education: A four year experiment of American College of Osteopathic Surgeons Resident Examinations. Advances in Health Sciences Education: Theory and Practice 5(2):117129 [29] Verhoeven BH, Snellen-Balendong HA, Hay IT, Boon JM, van der Linde MJ, Blitz-Lindeque JJ, Hoogenboom RJI, Verwijnen GM, Wijnen WHFW, Scherpbier AJJA, et al. 2005. The versatility of progress testing assessed in an international context: A start for benchmarking global standardization? Medical Teacher 27(6):514520
259
External links
Progress test Medicine Universittsmedizin Berlin (http://ptm.charite.de/en/) interuniversity Progress Test Medicine, the Netherlands (http://www.ivtg.nl/en/node/69) Academic Medicine (http://journals.lww.com/academicmedicine/pages/default.aspx) (Subscription) Advances in Health Sciences Education (http://www.springer.com/education/journal/10459) (Subscription) Medical Education (http://www.mededuc.com/) (Subscription) Medical Teacher (http://www.medicalteacher.org/) (Subscription)
Projective test
260
Projective test
Projective tests
Diagnostics MeSH D011386 [1]
In psychology, a projective test is a personality test designed to let a person respond to ambiguous stimuli, presumably revealing hidden emotions and internal conflicts. This is sometimes contrasted with a so called "objective test" in which responses are analyzed according to a universal standard (for example, a multiple choice exam). The responses to projective tests are content analyzed for meaning rather than being based on presuppositions about meaning, as is the case with objective tests. Projective tests have their origins in psychoanalytic psychology, which argues that humans have conscious and unconscious attitudes and motivations that are beyond or hidden from conscious awareness.
Theory
The general theoretical position behind projective tests is that whenever a specific question is asked, the response will be consciously-formulated and socially determined. These responses do not reflect the respondent's unconscious or implicit attitudes or motivations. The respondent's deep-seated motivations may not be consciously recognized by the respondent or the respondent may not be able to verbally express them in the form demanded by the questioner. Advocates of projective tests stress that the ambiguity of the stimuli presented within the tests allow subjects to express thoughts that originate on a deeper level than tapped by explicit questions. Projective tests lost some of their popularity during the 1980s and 1990s in part because of the overall loss of popularity of the psychoanalytic method and theories. Despite this, they are still used quite frequently.
Projective Hypothesis
This holds that an individual puts structure on an ambiguous situation in a way that is consistent with their own conscious & unconscious needs. It is an indirect method- testee is talking about something other than him/her self. Reduces temptation to fake Doesn't depend as much on verbal abilities Taps both conscious & unconscious traits Focus is clinical perspective - not normative - but has developed norms over the years [2]
Common variants
Rorschach
The best known and most frequently used projective test is the Rorschach inkblot test, in which a subject is shown a series of ten irregular but symmetrical inkblots, and asked to explain what they see.[] The subject's responses are then analyzed in various ways, noting not only what was said, but the time taken to respond, which aspect of the drawing was focused on, and how single responses compared to other responses for the same drawing. For example, if someone consistently sees the images as threatening and frightening, the tester might infer that the subject may suffer from paranoia.
Projective test
261
Draw-A-Person test
The Draw-A-Person test requires the subject to draw a person. The results are based on a psychodynamic interpretation of the details of the drawing, such as the size, shape and complexity of the facial features, clothing and background of the figure. As with other projective tests, the approach has very little demonstrated validity and there is evidence that therapists may attribute pathology to individuals who are merely poor artists.[] A similar class of techniques is kinetic family drawing. Criticisms of Drawing Tests Among the plausible but empirically untrue relations that have been claimed: Large size = Emotional expansiveness or acting out Small size = emotional constriction; withdrawal, or timidity Erasures around male buttocks; long eyelashes on males = homoeroticism Overworked lines = tension, aggression Distorted or omitted features = Conflicts related to that feature Large or elaborate eyes = Paranoia [4]
Projective test
262
TEMAS - hispanic children Make-A-Picture Story- make own pictures from figures 6yrs & up [2]
Graphology
A lesser-known projective test is graphology or handwriting analysis. Clinicians who assess handwriting to derive tentative information about the writer's personality attend to and analyze the writing's organization on the page, movement style and use of distinct letterforms.[9]
Statistical debate
From the perspective of statistical validity, psychometrics and positivism, criticisms of projective tests, and depth psychology tests, usually include the well-known discrepancy between statistical validity and clinical validity.[10] include that they rely heavily on clinical judgement, lack statistical reliability and statistical validity and many have no standardized criteria to which results may be compared, however this is not always the case. These tests are used frequently, though the scientific evidence is sometimes debated. There have been many empirical studies based on projective tests (including the use of standardized norms and samples), particularly more established tests. The criticism of lack of scientific evidence to support them and their continued popularity has been referred to as the "projective paradox".[] Responding to the statistical criticism of his projective test, Leopold Szondi said that his test actually discovers "fate and existential possibilities hidden in the inherited familial uncounscious and the personal unconscious, even those hidden because never lived through or because have been rejected. Is any statistical method hable to span, understand and integrate mathematically all these possibilities? I deny this cathegorically."[11]
Projective test
263
Situation Variables
Age of examiner Specific instructions Subtle reinforcement cues Setting-privacy [12]
Terminology
The terms "objective test" and "projective test" have recently come under criticism in the Journal of Personality Assessment. The more descriptive "rating scale or self-report measures" and "free response measures" are suggested, rather than the terms "objective tests" and "projective tests," respectively.[13]
Uses in marketing
Projective techniques, including TATs, are used in qualitative marketing research, for example to help identify potential associations between brand images and the emotions they may provoke. In advertising, projective tests are used to evaluate responses to advertisements. The tests have also been used in management to assess achievement motivation and other drives, in sociology to assess the adoption of innovations, and in anthropology to study cultural meaning. The application of responses is different in these disciplines than in psychology, because the responses of multiple respondents are grouped together for analysis by the organisation commissioning the research, rather than interpreting the meaning of the responses given by a single subject.
References
[1] http:/ / www. nlm. nih. gov/ cgi/ mesh/ 2011/ MB_cgi?field=uid& term=D011386 [2] Projective Methods for Personality Assessment. (n.d.). Retrieved November 21, 2012, from http:/ / www. neiu. edu/ ~mecondon/ proj-lec. htm. [3] Gamble, K. R. (1972). The holtzman inkblot technique. Psychological Bulletin, 77(3), 172-194. [4] Projective Tests. (n.d.) Retrieved November 21, 2012 from http:/ / web. psych. ualberta. ca/ ~chrisw/ L12ProjectiveTests/ L12ProjectiveTests. pdf [5] Piotrowski, Z. (1958-01-01). The Tomkins-Horn Picture Arrangement Test. The journal of nervous and mental disease, 126(1), 106. [6] Merriam-Webster. (n.d.). Retrieved November 21, 2012, from http:/ / www. merriam-webster. com/ dictionary/ word-association%20test [7] Spiteri, S. P. (n.d.). "Word association testing and thesaurus construction." Retrieved November 21,2012, from Dalhousie University, School of Library and Information Studies website: http:/ / libres. curtin. edu. au/ libres14n2/ Spiteri_final. htm [8] Schultz, D. P., & Schultz, S. E. (2000). "The history of modern psychology." Seventh edition. Harcourt College Publishers. [9] Poizner, Annette (2012). Clinical Graphology: An Interpretive Manual for Mental Health Practitioners. Springfield, Illinois: Charles C Thomas Publishers. [10] Leopold Szondi (1960) Das zweite Buch: Lehrbuch der Experimentellen Triebdiagnostik. Huber, Bern und Stuttgart, 2nd edition. Ch.27, From the Spanish translation, B)II Las condiciones estadisticas, p.396. Quotation: [11] Szondi (1960) Das zweite Buch: Lehrbuch der Experimentellen Triebdiagnostik. Huber, Bern und Stuttgart, 2nd edition. Ch.27, From the Spanish translation, B)II Las condiciones estadisticas, p.396 [12] Shatz, Phillip. (n.d.) "Projective personality testing: Psychological testing." Retrieved November 21, 2012, from Staint Joseph's University: Department of Psychology Web site: http:/ / schatz. sju. edu/ intro/ 1001lowfi/ personality/ projectiveppt/ sld001. htm
Projective test
[13] Meyer, Gregory J. and Kurtz, John E.(2006) 'Advancing Personality Assessment Terminology: Time to Retire "Objective" and "Projective" As Personality Test Descriptors', Journal of Personality Assessment, 87: 3, 223 225
264
Footnotes
Theodor W. Adorno, et al. (1964). The Authoritarian Personality. New York: John Wiley & Sons. Lawrence Soley & Aaron Lee Smith (2008). Projective Techniques for Social Science and Business Research. Milwaukee: The Southshore Press.
Prometric
265
Prometric
Prometric
Type Founded Subsidiary 1990
Prometric is a U.S. company in the test administration industry. Prometric operates a test center network composed of over 10,000 sites in 160 countries. Many examinations are administered at Prometric sites including those from Nationwide Mortgage Licensing System and Registry, Microsoft, IBM, Apple, the Common Admission Test (CAT) of the IIMs, the European Personnel Selection Office, the Medical College Admission Test, USMLE, the Diplomate of National Board-Common Entrance Test of National Board of Examinations, the Uniform Certified Public Accountant Examination, Architect Registration Examination, and the USPTO registration examination. Prometric's corporate headquarters are located in Canton (Baltimore, Maryland) in the United States.
History
Prometric's computerized testing centers were originally founded by Drake International in 1990 under the name Drake Prometric.[2] In 1995, Drake Prometric L.P. was sold to Sylvan Learning in a cash and stock deal worth approximately $44.5 million.[3] The acquired business was renamed Sylvan Prometric, then sold to Thomson Corporation in 2000.[4] The Thomson Corporation announced its desire to sell Prometric in the fall of 2006, and Educational Testing Service announced its plans to acquire it.[5] On Monday, October 15, 2007, Educational Testing Service (ETS) closed its acquisition of Prometric from the Thomson Corporation.[6] Prometric is currently a wholly owned, independently operated subsidiary of ETS, allowing ETS to maintain non-profit status.
Business
Prometric sells a range of services, including test development, test delivery, and data management capabilities. Prometric delivers and administers tests to approximately 500 clients in the academic, professional, government, corporate and information technology markets. While there are 3000 Prometric test centers across the world,[7] including every U.S. state and territory (except Wake Island), whether a particular test can be taken outside the U.S. depends on the testing provider. For example, despite the fact that Prometric test centers exist worldwide, some exams are only offered in the country where the client program exists. The locations where a test is offered, as well as specific testing procedures for the day of the exam, are dictated by the client. In 2009, the company was involved in a controversy due to widespread technical problems on one of India's MBA entrance exams, the Common Admission Test.[8] While Prometric claims that the problems were due to common viruses,[9] this claim was disputed since these tests were not internet-based and were rather offered on local area networks within India, where the virus was pre-existent.[10] Due to this controversy Prometric allowed 8000 students to reappear for the examination.[11]
Prometric
266
International
In the Republic of Ireland, Prometric's local subsidiary are responsible for administering the Driver Theory Test.[12]
References
[1] [2] [3] [4] [5] http:/ / www. prometric. com Drake International early years (http:/ / celebratewithdrake. com/ entrepreneurial) Sylvan to acquire test firm (http:/ / articles. baltimoresun. com/ 1995-07-22/ business/ 1995203052_1_sylvan-drake-financial-targets) Thomson Acquires Prometric (http:/ / www. encyclopedia. com/ doc/ 1G1-58958755. html) ETS news ETS to Acquire Prometric (http:/ / www. etsemea-customassessments. org/ cas-en/ media/ press-releases/ ets-to-acquire-thomson-prometric/ ) [6] (http:/ / thomsonreuters. com/ content/ press_room/ corp/ corp_news/ 217831) [7] QAI India Ltd Announces A Partnership with Prometric (http:/ / www. newswiretoday. com/ news/ 38336/ ) [8] Online CAT Puts Prometric in Mousetrap (http:/ / news. ciol. com/ News/ News-Reports/ Online-CAT-puts-Prometric-in-mousetrap/ 301109128324/ 0/ ) [9] Time of India - Viruses Cause CAT Failure (http:/ / timesofindia. indiatimes. com/ india/ IIM-A-names-2-viruses-that-caused-CAT-chaos/ articleshow/ 5286411. cms) [10] CAT Server Crash: Prometric's Virus Theory Rubbished (http:/ / businesstechnology. in/ tools/ news/ 2009/ 11/ 30/ CAT-server-crash-Prometric-s-virus-theory-rubbished. html) [11] Retest for 8000 students (http:/ / www. catiim. in/ notice_17122009. html) [12] http:/ / www. theorytest. ie/
External links
Prometric website (http://www.prometric.com/)
Psychological statistics
Psychology
Basic types
Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality
Psychological statistics
267
Positive Quantitative Social
Applied psychology
Applied behavior analysis Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport
Lists
Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal
Psychological statistics is the application of statistics to psychology. Some of the more common applications include: 1. 2. 3. 4. 5. 6. 7. psychometrics learning theory perception human development abnormal psychology Personality test psychological tests
Some of the more commonly used statistical tests in psychology are: Parametric tests Student's t-test analysis of variance (ANOVA) ANCOVA (Analysis of Covariance) MANOVA (Multivariate Analysis of Variance) regression analysis
Psychological statistics linear regression hierarchical linear modelling correlation Pearson product-moment correlation coefficient Spearman's rank correlation coefficient Non-parametric tests chi-square MannWhitney U
268
References
Cohen, B.H. (2007) Explaining Psychological Statistics, 3rd Edition, Wiley. ISBN 978-0-470-00718-1 Howell, D. (2009) Statistical Methods for Psychology, International Edition, Wadsworth. ISBN 0-495-59785-6
External links
Charles McCreerys tutorials on chi-square, probability and Bayes theorem for Oxford University psychology students [1] Matthew Rockloff's tutorials on t-tests, correlation and ANOVA [2]
References
[1] http:/ / www. celiagreen. com/ charlesmccreery. html [2] http:/ / psychologyaustralia. homestead. com/ index. htm
Psychometric function
A psychometric function describes the relationship between a parameter of a physical stimulus and the subjective responses of the subject. The psychometric function is a special case of the General Linear Model (GLM). The probability of response is related to a linear combine of predictors by means of a sigmoid link function (e.g. probit, logit, etc.). Depending on the number of alternative choices, the psychophysical experimental paradigms classify as simple forced choice (also known as yes-no task), two-alternative forced choice (2AFC), and n-alternative forced choice. The number of alternatives in the experiment determine the lower asymptode of the function. Two different types of psychometric plots are in common use. One plots the percentage of correct responses (or a similar value) displayed on the y-axis and the physical parameter on the x-axis. If the stimulus parameter is very far towards one end of its possible range, the person will always be able to respond correctly. Towards the other end of the range, the person never perceives the stimulus properly and therefore the probability of correct responses is at chance level. In between, there is a transition range where the subject has an above-chance rate of correct responses, but does not always respond correctly. The inflection point of the sigmoid function or the point at which the function reaches the middle between the chance level and 100% is usually taken as sensory threshold. The second type plots the proportion of "yes" responses on the y-axis, and therefore will have a sigmoidal shape covering the range [0, 1], rather than merely [0.5, 1], and we move from a subject being certain that the stimulus was not of the particular type requested to certainty that it was. This second way of plotting psychometric functions is often preferable, as it is more easily amenable to principled quantitative analysis using tools such as probit analysis (fitting of cumulative Gaussian distributions). However, it also has important drawbacks. First, the threshold estimation is based only on p(yes), namely on "Hit" in Signal Detection Theory terminology. Second, and consequently, it is not bias free or criterion free. Third, the threshold is identified with the p(yes) = .5, which is just a conventional and arbitrary choice.
Psychometric function A common example is visual acuity testing with an eye chart. The person sees symbols of different sizes (the size is the relevant physical stimulus parameter) and has to decide which symbol it is. Usually, there is one line on the chart where a subject can identify some, but not all, symbols. This is equal to the transition range of the psychometric function and the sensory threshold corresponds to visual acuity. (Strictly speaking, a typical optometric measurement does not exactly yield the sensory threshold due to biases in the standard procedure.)
269
Psychometrics of racism
Psychometrics of racism is an emerging field that aims to measure the incidence and impacts of racism on the psychological well-being of people of all races. At present, there are few instruments that attempt to capture the experience of racism in all of its complexity.[1]
Self-reported inventories
The Schedule of Racist Events (SRE) is questionnaire for assessing frequency of racial discrimination in lives of African Americans created in 1998 by Hope Landrine and Elizabeth A. Klonoff. SRE is an 18-item self-report inventory, assesses frequency of specific racist events in past year and in one's entire life, and measures to what extent this discrimination was stressful.[2] Other psychometric tools for assessing the impacts of racism include:[3] The Racism Reaction Scale (RRS) Perceived Racism Scale (PRS) Index of Race-Related Stress (IRRS) Racism and Life Experience Scale-Brief Version (RaLES-B) Telephone-Administered Perceived Racism Scale (TPRS)[4]
Physiological metrics
In a summary of recent research Jules P. Harrell, Sadiki Hall, and James Taliaferro describe how a growing body of research has explored the impact of encounters with racism or discrimination on physiological activity. Several of the studies suggest that higher blood pressure levels are associated with the tendency not to recall or report occurrences identified as racist and discriminatory. In other words, failing to recognize instances of racism is directly impacted by the blood pressure of the person experiencing the racist event. Investigators have reported that physiological arousal is associated with laboratory analogues of ethnic discrimination and mistreatment.[5]
References
[1] The perceived racism scale: a multidimensional assessment of the experience of white racism among African Americans. (http:/ / www. ncbi. nlm. nih. gov/ entrez/ query. fcgi?cmd=Retrieve& db=PubMed& list_uids=8882844& dopt=Citation) [2] The Schedule of Racist Events: A Measure of Racial Discrimination and a Study of Its Negative Physical and Mental Health Consequences. (http:/ / eric. ed. gov/ ERICWebPortal/ Home. portal?_nfpb=true& _pageLabel=RecordDetails& ERICExtSearch_SearchValue_0=EJ528856& ERICExtSearch_SearchType_0=eric_accno& objectId=0900000b8002502e) [3] Assessing the Stressful Effects of Racism: A Review of Instrumentation (http:/ / jbp. sagepub. com/ cgi/ content/ abstract/ 24/ 3/ 269) [4] Development and Reliability of a Telephone-Administered Perceived Racism Scale (TPRS): A Tool for Epidemiological Use (http:/ / apt. allenpress. com/ aptonline/ ?request=get-abstract& issn=1049-510X& volume=011& issue=02& page=0251) [5] Physiological Responses to Racism and Discrimination: An Assessment of the Evidence (http:/ / www. ajph. org/ cgi/ content/ abstract/ 93/ 2/ 243)
270
5. Report Writing & presentation. A brief discussion on these steps is: 1. Problem audit and problem definition - What is the problem? What are the various aspects of the problem? What information is needed? 2. Conceptualization and operationalization - How exactly do we define the concepts involved? How do we translate these concepts into observable and measurable behaviours? 3. Hypothesis specification - What claim(s) do we want to test? 4. Research design specification - What type of methodology to use? - examples: questionnaire, survey 5. Question specification - What questions to ask? In what order? 6. Scale specification - How will preferences be rated? 7. Sampling design specification - What is the total population? What sample size is necessary for this population? What sampling method to use?- examples: Probability Sampling:- (cluster sampling, stratified sampling, simple random sampling, multistage sampling, systematic sampling) & Nonprobability sampling:- (Convenience Sampling,Judgement Sampling, Purposive Sampling, Quota Sampling, Snowball Sampling, etc. ) 8. Data collection - Use mail, telephone, internet, mall intercepts 9. Codification and re-specification - Make adjustments to the raw data so it is compatible with statistical techniques and with the objectives of the research - examples: assigning numbers, consistency checks, substitutions, deletions, weighting, dummy variables, scale transformations, scale standardization 10. Statistical analysis - Perform various descriptive and inferential techniques (see below) on the raw data. Make inferences from the sample to the whole population. Test the results for statistical significance. 11. Interpret and integrate findings - What do the results mean? What conclusions can be drawn? How do these findings relate to similar research? 12. Write the research report - Report usually has headings such as: 1) executive summary; 2) objectives; 3) methodology; 4) main findings; 5) detailed charts and diagrams. Present the report to the client in a 10 minute presentation. Be prepared for questions. The design step may involve a pilot study in order to discover any hidden issues. The codification and analysis steps are typically performed by computer, using statistical software. The data collection steps, can in some instances be automated, but often require significant manpower to undertake. Interpretation is a skill mastered only by experience.
271
Statistical analysis
The data acquired for quantitative marketing research can be analysed by almost any of the range of techniques of statistical analysis, which can be broadly divided into descriptive statistics and statistical inference. An important set of techniques is that related to statistical surveys. In any instance, an appropriate type of statistical analysis should take account of the various types of error that may arise, as outlined below.
272
Types of errors
Random sampling errors: sample too small sample not representative inappropriate sampling method used random errors
Research design errors: bias introduced measurement error data analysis error sampling frame error population definition error scaling error question construction error
Interviewer errors: recording errors cheating errors questioning errors respondent selection error Respondent errors: non-response error inability error falsification error Hypothesis errors: type I error (also called alpha error) the study results lead to the rejection of the null hypothesis even though it is actually true type II error (also called beta error) the study results lead to the acceptance (non-rejection) of the null hypothesis even though it is actually false
273
References
Bradburn, Norman M. and Seymour Sudman. Polls and Surveys: Understanding What They Tell Us (1988) Converse, Jean M. Survey Research in the United States: Roots and Emergence 1890-1960 (1987), the standard history Glynn, Carroll J., Susan Herbst, Garrett J. O'Keefe, and Robert Y. Shapiro. Public Opinion (1999) [1] textbook Oskamp, Stuart and P. Wesley Schultz; Attitudes and Opinions (2004) [2] James G. Webster, Patricia F. Phalen, Lawrence W. Lichty; Ratings Analysis: The Theory and Practice of Audience Research Lawrence Erlbaum Associates, 2000 Young, Michael L. Dictionary of Polling: The Language of Contemporary Opinion Research (1992) [3]
References
[1] http:/ / www. questia. com/ PM. qst?a=o& d=100501261 [2] http:/ / www. questia. com/ PM. qst?a=o& d=104829752 [3] http:/ / www. questia. com/ PM. qst?a=o& d=59669912
Quantitative psychology
Psychology
Basic types
Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social
Applied psychology
Quantitative psychology
274
Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport
Lists
Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal
The American Psychological Association defines Quantitative Psychology as "the study of methods and techniques for the measurement of human attributes, the statistical and mathematical modeling of psychological processes, the design of research studies, and the analysis of psychological data".[1] Quantitative Psychology specializes in the measurement, methodology and research design and analysis relevant to data in the social sciences.[2] "The Research in study of Quantitative psychology develops psychological theory in relation to mathematics and statistics. Elaborating the existing methods and developing new concepts, the quantitative psychology involves much more than "applications" of statistics and mathematics." [3] Research in quantitative psychology develops psychological theory in relation to mathematics and statistics. Psychological research requires the elaboration of existing methods and the development of new concepts, so that quantitative psychology requires more than "applications" of statistics and mathematics.[1] Quantitative psychology has two major subfields, psychometrics and mathematical psychology. Research in psychometrics develops methods of practice and analysis of psychological measurement, for example, developing a questionnaire to test memory and methods of analyzing data from that questionnaire.[4] Research in mathematical psychology develops novel mathematical models that describe psychological processes.[5] Quantitative psychology is served by several scientific organizations. These include the Psychometric Society, Division 5 of the American Psychological Association (Evaluation, Measurement and Statistics), the Society of Multivariate Experimental Psychology, and the European Society for Methodology. Associated disciplines include statistics, mathematics, educational measurement, educational statistics, sociology, and political science. Several scholarly journals reflect the efforts of scientists in these areas, notably Psychometrika, Multivariate Behavioral Research, Structural Equation Modeling and Psychological Methods. In August 2005, the APA expressed the need for more quantitative psychologists in the industryfor every PhD awarded in the subject, there were about 2.5 quantitative psychologist position openings.[6] Currently, 23 American universities offer Ph.D. programs in quantitative psychology within their psychology departments (and additional universities offer programs that focus on but do not necessarily encompass the field).[7] There is also a comparable
275
References
[1] Quantitative Psychology (http:/ / www. apa. org/ research/ tools/ quantitative/ index. aspx) [2] Quantitative Psychology UCLA Psychology Department: Home (http:/ / www. psych. ucla. edu/ graduate/ areas-of-study/ quantitative-psychology) [3] Quantitative Psychology For Measuring The Human Attributes (http:/ / www. psychoid. net/ quantitative-psychology-for-measuring-the. html) [4] Psychometrics [5] Mathematical Psychology [6] Report of the Task Force for Increasing the Number of Quantitative Psychologists (http:/ / www. apa. org/ research/ tools/ quantitative/ quant-task-force-report. pdf), page 1. American Psychological Association. Retrieved February 15, 2012 [7] Introduction to Quantitative Psychology (http:/ / www. apa. org/ research/ tools/ quantitative/ index. aspx#review) page 2. American Psychological Association. Retrieved February 15, 2012. [8] Graduate Studies in Psychology (http:/ / www. apa. org/ pubs/ books/ 4270096. aspx)
External links
APA Division 5: Evaluation, Measurement and Statistics (http://www.apa.org/divisions/div5/) The Psychometric Society (http://www.psychometrika.org/) The Society of Multivariate Experimental Psychology (http://www.smep.org/) The European Society for Methodology (http://www.smabs.org/) Society for Mathematical Psychology (http://www.cogs.indiana.edu/socmathpsych/)
Questionnaire construction
A questionnaire is a series of questions asked to individuals to obtain statistically useful information about a given topic.[1] When properly constructed and responsibly administered, questionnaires become a vital instrument by which statements can be made about specific groups or people or entire populations. Questionnaires are frequently used in quantitative marketing research and social research. They are a valuable method of collecting a wide range of information from a large number of individuals, often referred to as respondents. Adequate questionnaire construction is critical to the success of a survey. Inappropriate questions, incorrect ordering of questions, incorrect scaling, or bad questionnaire format can make the survey valueless, as it may not accurately reflect the views and opinions of the participants. A useful method for checking a questionnaire and making sure it is accurately capturing the intended information is to pretest among a smaller subset of target respondents.
Questionnaire construction Unneeded questions are an expense to the researcher and an unwelcome imposition on the respondents. All questions should contribute to the objective(s) of the research. If you "research backwards" and determine what you want to say in the report (i.e., Package A is more/less preferred by X% of the sample vs. Package B, and y% compared to Package C) then even though you don't know the exact answers yet, you will be certain to ask all the questions you need - and only the ones you need - in such a way (metrics) to write your report. The topics should fit the respondents frame of reference. Their background may affect their interpretation of the questions. Respondents should have enough information or expertise to answer the questions truthfully. The type of scale, index, or typology to be used shall be determined. The level of measurement you use will determine what you can do with and conclude from the data. If the response option is yes/no then you will only know how many or what percent of your sample answered yes/no. You cannot, however, conclude what the average respondent answered. The types of questions (closed, multiple-choice, open) should fit the statistical data analysis techniques available and your goals. Questions and prepared responses to choose from should be neutral as to intended outcome. A biased question or questionnaire encourages respondents to answer one way rather than another.[2] Even questions without bias may leave respondents with expectations. The order or "natural" grouping of questions is often relevant. Prior previous questions may bias later questions. The wording should be kept simple: no technical or specialized words. The meaning should be clear. Ambiguous words, equivocal sentence structures and negatives may cause misunderstanding, possibly invalidating questionnaire results. Double negatives should be reworded as positives. If a survey question actually contains more than one issue, the researcher will not know which one the respondent is answering. Care should be taken to ask one question at a time. The list of possible responses should be collectively exhaustive. Respondents should not find themselves with no category that fits their situation. One solution is to use a final category for "other ________". The possible responses should also be mutually exclusive. Categories should not overlap. Respondents should not find themselves in more than one category, for example in both the "married" category and the "single" category there may be need for separate questions on marital status and living situation. Writing style should be conversational, yet concise and accurate and appropriate to the target audience. Many people will not answer personal or intimate questions. For this reason, questions about age, income, marital status, etc. are generally placed at the end of the survey. This way, even if the respondent refuses to answer these "personal" questions, he/she will have already answered the research questions. "Loaded" questions evoke emotional responses and may skew results. Presentation of the questions on the page (or computer screen) and use of white space, colors, pictures, charts, or other graphics may affect respondent's interest or distract from the questions. Numbering of questions may be helpful. Questionnaires can be administered by research staff, by volunteers or self-administered by the respondents. Clear, detailed instructions are needed in either case, matching the needs of each audience.
276
Methods of collection
Questionnaire construction
277
Benefits/Cautions Low cost-per-response. Mail is subject to postal delays, which can be substantial when posting remote areas or unpredictable events such as natural disasters. Survey participants can choose to remain anonymous. It is not labour intensive. Questionnaires can be conducted swiftly. Rapport with respondents High response rate Be careful that your sampling frame (i.e., where you get the phone numbers from) doesn't skew your sample, For example, if you select the phone numbers from a phone book, you are necessarily excluding people who only have a mobile phone, those who requested an unpublished phone number, and individuals who have recently moved to the area because none of these people will be in the book. Are more prone to social desirability biases than other modes, so telephone interviews are generally not suitable for sensitive [3][4] topics This method has a low ongoing cost, and on most surveys costs nothing for the participants and little for the surveyors. However, Initial set-up costs can be high for a customised design due to the effort required in developing the back-end system or programming the questionnaire itself. Questionnaires can be conducted swiftly, without postal delays. Survey participants can choose to remain anonymous, though risk being tracked through cookies, unique links and other technology. It is not labour intensive. Questions can be more detailed, as opposed to the limits of paper or telephones. [citation needed] This method works well if your survey contains several branching questions. Help or instructions can be dynamically displayed with the question as needed, and automatic sequencing means the computer can determine the next question, rather than relying on respondents to correctly follow skip instructions. Not all of the sample may be able to access the electronic form, and therefore results may not be representative of the target population. Questions can be more detailed and obtains a lot of comprehensive information, as opposed to the limits of paper or telephones. However, respondents are often limited to their working memory: specially designed visual cues (such as prompt cards) may help in some cases. Rapport with respondents is generally higher than other modes Typically higher response rate than other modes. Can be extremely expensive and time consuming to train and maintain an interviewer panel. Each interview also has a marginal cost associated with collecting the data. Usually a convenience (vs. a statistical or representative) sample so you cannot generalize your results. However, use of rigorous selection methods (e.g. those used by national statistical organisations) can result in a much more representative sample.
Electronic
Personally Administered
Types of questions
1. Contingency questions - A question that is answered only if the respondent gives a particular response to a previous question. This avoids asking questions of people that do not apply to them (for example, asking men if they have ever been pregnant). 2. Matrix questions - Identical response categories are assigned to multiple questions. The questions are placed one under the other, forming a matrix with response categories along the top and a list of questions down the side. This is an efficient use of page space and respondents time. 3. Closed ended questions - Respondents answers are limited to a fixed set of responses. Most scales are closed ended. Other types of closed ended questions include: Yes/no questions - The respondent answers with a "yes" or a "no". Multiple choice - The respondent has several option from which to choose.
Questionnaire construction Scaled questions - Responses are graded on a continuum (example : rate the appearance of the product on a scale from 1 to 10, with 10 being the most preferred appearance). Examples of types of scales include the Likert scale, semantic differential scale, and rank-order scale (See scale for a complete list of scaling techniques.). 4. Open ended questions - No options or predefined categories are suggested. The respondent supplies their own answer without being constrained by a fixed set of possible responses. Examples of types of open ended questions include: Completely unstructured - For example, "What is your opinion on questionnaires?" Word association - Words are presented and the respondent mentions the first word that comes to mind. Sentence completion - Respondents complete an incomplete sentence. For example, "The most important consideration in my decision to buy a new house is . . ." Story completion - Respondents complete an incomplete story. Picture completion - Respondents fill in an empty conversation balloon. Thematic apperception test - Respondents explain a picture or make up a story about what they think is happening in the picture
278
Question sequence
Questions should flow logically from one to the next. The researcher must ensure that the answer to a question is not influenced by previous questions. Questions should flow from the more general to the more specific. Questions should flow from the least sensitive to the most sensitive. Questions should flow from factual and behavioral questions to attitudinal and opinion questions. Questions should flow from unaided to aided questions. According to the three stage theory (also called the sandwich theory), initial questions should be screening and rapport questions. Then in the second stage you ask all the product specific questions. In the last stage you ask demographic questions.
Marketings
Computer-assisted telephone interviewing Computer-assisted personal interviewing Automated computer telephone interviewing Official statistics Bureau of Labor Statistics Questionnaires Questionnaire construction Paid survey Data Mining NIPO Software DIY research SPSS Marketing Marketing Research Scale Statistical survey
Questionnaire construction
279
References
[1] Merriam-Webster's Online Dictionary, s.v. "questionnaire," http:/ / www. merriam-webster. com/ dictionary/ questionnaire (accessed May 21, 2008) [2] Timothy R. Graeff, 2005. "Response Bias," Encyclopedia of Social Measurement, pp. 411 (http:/ / www. sciencedirect. com/ science/ article/ pii/ B0123693985000372)-418. ScienceDirect. [3] Frauke Kreuter, Stanley Presser, and Roger Tourangeau, 2008. "Social Desirability Bias in CATI, IVR, and Web Surveys: The Effects of Mode and Question Sensitivity", Public Opinion Quarterly, 72(5): 847-865 first published online January 26, 2009 [4] Allyson L. Holbrook, Melanie C. Green And Jon A. Krosnick, 2003. "Telephone versus Face-to-Face Interviewing of National Probability Samples with Long Questionnaires: Comparisons of Respondent Satisficing and Social Desirability Response Bias". Public Opinion Quarterly,67(1): 79-125. .
External links
How to ask questions for better survey response (http://www.sensorpro.net/SurveyGuidelines.pdf) (SensorPro)
Rasch model
Rasch models are used for analyzing categorical data from assessments to measure variables such as abilities, attitudes, and personality traits. For example, they may be used to estimate a student's reading ability from answers to questions on a reading assessment, or the extremity of a person's attitude to capital punishment from responses on a questionnaire. Rasch models are particularly used in psychometrics, the field concerned with the theory and technique of psychological and educational measurement. In addition, they are increasingly being used in other areas, including the health profession and market research because of their general applicability. The mathematical theory underlying Rasch models is a special case of item response theory and, more generally, a special case of a generalized linear model. However, there are important differences in the interpretation of the model parameters and its philosophical implications [1] that separate proponents of the Rasch model from the item response modeling tradition. A central aspect of this divide relates to the role of specific objectivity [2], a defining property of the Rasch model according to Georg Rasch, as a requirement for successful measurement. Application of the models provides diagnostic information regarding how well the criterion is met. Application of the models can also provide information about how well items or questions on assessments work to measure the ability or trait. Prominent advocates of Rasch models include Benjamin Drake Wright, David Andrich and Erling Andersen.
Rasch model
280
Overview
The Rasch model for measurement
In the Rasch model, the probability of a specified response (e.g. right/wrong answer) is modeled as a function of person and item parameters. Specifically, in the simple Rasch model, the probability of a correct response is modeled as a logistic function of the difference between the person and item parameter. The mathematical form of the model is provided later in this article. In most contexts, the parameters of the model pertain to the level of a quantitative trait possessed by a person or item. For example, in educational tests, item parameters pertain to the difficulty of items while person parameters pertain to the ability or attainment level of people who are assessed. The higher a person's ability relative to the difficulty of an item, the higher the probability of a correct response on that item. When a person's location on the latent trait is equal to the difficulty of the item, there is by definition a 0.5 probability of a correct response in the Rasch model. The purpose of applying the model is to obtain measurements from categorical response data. Estimation methods are used to obtain estimates from matrices of response data based on the model (Linacre, 1999). A Rasch model is a model in one sense in that it represents the structure which data should exhibit in order to obtain measurements from the data; i.e. it provides a criterion for successful measurement. Beyond data, Rasch's equations model relationships we expect to obtain in the real world. For instance, education is intended to prepare children for the entire range of challenges they will face in life, and not just those that appear in textbooks or on tests. By requiring measures to remain the same (invariant) across different tests measuring the same thing, Rasch models make it possible to test the hypothesis that the particular challenges posed in a curriculum and on a test coherently represent the infinite population of all possible challenges in that domain. A Rasch model is therefore a model in the sense of an ideal or standard that provides a heuristic fiction serving as a useful organizing principle even when it is never actually observed in practice. The perspective or paradigm underpinning the Rasch model is distinctly different from the perspective underpinning statistical modelling. Models are most often used with the intention of describing a set of data. Parameters are modified and accepted or rejected based on how well they fit the data. In contrast, when the Rasch model is employed, the objective is to obtain data which fit the model (Andrich, 2004; Wright, 1984, 1999). The rationale for this perspective is that the Rasch model embodies requirements which must be met in order to obtain measurement, in the sense that measurement is generally understood in the physical sciences. A useful analogy for understanding this rationale is to consider objects measured on a weighing scale. Suppose the weight of an object A is measured as being substantially greater than the weight of an object B on one occasion, then immediately afterward the weight of object B is measured as being substantially greater than the weight of object A. A property we require of measurements is that the resulting comparison between objects should be the same, or invariant, irrespective of other factors. This key requirement is embodied within the formal structure of the Rasch model. Consequently, the Rasch model is not altered to suit data. Instead, the method of assessment should be changed so that this requirement is met, in the same way that a weighing scale should be rectified if it gives different comparisons between objects upon separate measurements of the objects. Data analysed using the model are usually responses to conventional items on tests, such as educational tests with right/wrong answers. However, the model is a general one, and can be applied wherever discrete data are obtained with the intention of measuring a quantitative attribute or trait.
Rasch model
281
Scaling
When all test-takers have an opportunity to attempt all items on a single test, each total score on the test maps to a unique estimate of ability and the greater the total, the greater the ability estimate. Total scores do not have a linear relationship with ability estimates. Rather, the relationship is non-linear as shown in Figure 1. The total score is shown on the vertical axis, while the corresponding person Figure 1: Test characteristic curve showing the relationship between total score on a test location estimate is shown on the and person location estimate horizontal axis. For the particular test on which the test characteristic curve (TCC) shown in Figure 1 is based, the relationship is approximately linear throughout the range of total scores from about 10 to 33. The shape of the TCC is generally somewhat sigmoid as in this example. However, the precise relationship between total scores and person location estimates depends on the distribution of items on the test. The TCC is steeper in ranges on the continuum in which there are a number of items, such as in the range on either side of 0 in Figures 1 and 2. In applying the Rasch model, item locations are often scaled first, based on methods such as those described below. This part of the process of scaling is often referred to as item calibration. In educational tests, the smaller the proportion of correct responses, the higher the difficulty of an item and hence the higher the item's scale location. Once item locations are scaled, the person locations are measured on the scale. As a result, person and item locations are estimated on a single scale as shown in Figure 2.
Rasch model When responses of a person are listed according to item difficulty, from lowest to highest, the most likely pattern is a Guttman pattern or vector; i.e. {1,1,...,1,0,0,0,...,0}. However, while this pattern is the most probable given the structure of the Rasch model, the model requires only probabilistic Guttman response patterns; that is, patterns which tend toward the Guttman pattern. It is unusual for responses to conform strictly to the pattern because there are many possible patterns. It is unnecessary for responses to conform strictly to the pattern in order for data to fit the Rasch model. Each ability estimate has an associated standard error of measurement, which quantifies the degree of uncertainty associated with the ability estimate. Item estimates also have standard errors. Generally, the standard errors of item estimates are considerably smaller than the standard errors of person estimates because there are usually more response data for an item than for Figure 3: ICCs for a number of items. ICCs are coloured to highlight the change in the a person. That is, the number of people probability of a successful response for a person with ability location at the vertical line. attempting a given item is usually The person is likely to respond correctly to the easiest items (with locations to the left and higher curves) and unlikely to respond correctly to difficult items (locations to the right greater than the number of items and lower curves). attempted by a given person. Standard errors of person estimates are smaller where the slope of the ICC is steeper, which is generally through the middle range of scores on a test. Thus, there is greater precision in this range since the steeper the slope, the greater the distinction between any two points on the line. Statistical and graphical tests are used to evaluate the correspondence of data with the model. Certain tests are global, while others focus on specific items or people. Certain tests of fit provide information about which items can be used to increase the reliability of a test by omitting or correcting problems with poor items. In Rasch Measurement the person separation index is used instead of reliability indices. However, the person separation index is analogous to a reliability index. The separation index is a summary of the genuine separation as a ratio to separation including measurement error. As mentioned earlier, the level of measurement error is not uniform across the range of a test, but is generally larger for more extreme scores (low and high).
282
Rasch model The brief outline above highlights certain distinctive and interrelated features of Rasch's perspective on social measurement, which are as follows: 1. He was concerned principally with the measurement of individuals, rather than with distributions among populations. 2. He was concerned with establishing a basis for meeting a priori requirements for measurement deduced from physics and, consequently, did not invoke any assumptions about the distribution of levels of a trait in a population. 3. Rasch's approach explicitly recognizes that it is a scientific hypothesis that a given trait is both quantitative and measurable, as operationalized in a particular experimental context. Thus, congruent with the perspective articulated by Thomas Kuhn in his 1961 paper The function of measurement in modern physical science, measurement was regarded both as being founded in theory, and as being instrumental to detecting quantitative anomalies incongruent with hypotheses related to a broader theoretical framework. This perspective is in contrast to that generally prevailing in the social sciences, in which data such as test scores are directly treated as measurements without requiring a theoretical foundation for measurement. Although this contrast exists, Rasch's perspective is actually complementary to the use of statistical analysis or modelling that requires interval-level measurements, because the purpose of applying a Rasch model is to obtain such measurements. Applications of Rasch models are described in a wide variety of sources, including Sivakumar, Durtis & Hungi (2005), Bezruzcko (2005), Bond & Fox (2007), Fisher & Wright (1994), Masters & Keeves (1999), and the Journal of Applied Measurement.
283
Rasch model measurements. Rasch pointed out that the principle of invariant comparison is characteristic of measurement in physics using, by way of example, a two-way experimental frame of reference in which each instrument exerts a mechanical force upon solid bodies to produce acceleration. Rasch (1960/1980, pp. 1123) stated of this context: "Generally: If for any two objects we find a certain ratio of their accelerations produced by one instrument, then the same ratio will be found for any other of the instruments". It is readily shown that Newton's second law entails that such ratios are inversely proportional to the ratios of the masses of the bodies.
284
where
and
attainment item,
is the probability of success upon interaction between the relevant person and
assessment item. It is readily shown that the log odds, or logit, of correct response by a person to an item, based on the model, is equal to . It can be shown that the log odds of a correct response by a person to one item, conditional on a correct response to one of two items, is equal to the difference between the item locations. For example, where is the total score of person n over the two items, which implies a correct response to one or other of the items (Andersen, 1977; Rasch, 1960; Andrich, 2010). Hence, the conditional log odds does not involve the person parameter , which can therefore be eliminated by conditioning on the total score . That is, by partitioning the responses according to raw scores and calculating the log odds of a correct response, an estimate is obtained without involvement of . More generally, a number of item parameters can be estimated iteratively through application of a process such as Conditional Maximum Likelihood estimation (see Rasch model estimation). While more involved, the same fundamental principle applies in such estimations. The ICC of the Rasch model for dichotomous data is shown in Figure 4. The grey line maps a person with a location of approximately 0.2 on the latent continuum, to the probability of the discrete outcome for items with different locations on the latent continuum. The location of an item is, by definition, that location at which the probability that is equal to 0.5. In figure 4, the black circles represent the
Figure 4: ICC for the Rasch model showing the comparison between observed and expected proportions correct for five Class Intervals of persons
actual or observed proportions of persons within Class Intervals for which the outcome was observed. For example, in the case of an assessment item used in the context of educational psychology, these could represent the proportions of persons who answered the item correctly. Persons are ordered by the estimates of their locations on the latent continuum and classified into Class Intervals on this basis in order to graphically inspect the accordance of observations with the
Rasch model model. There is a close conformity of the data with the model. In addition to graphical inspection of data, a range of statistical tests of fit are used to evaluate whether departures of observations from the model can be attributed to random effects alone, as required, or whether there are systematic departures from the model.
285
Other considerations
A criticism of the Rasch model is that it is overly restrictive or prescriptive because it does not permit each item to have a different discrimination. A criticism specific to the use of multiple choice items in educational assessment is that there is no provision in the model for guessing because the left asymptote always approaches a zero probability in the Rasch model. These variations are available in models such as the two and three parameter logistic models (Birnbaum, 1968). However, the specification of uniform discrimination and zero left asymptote are necessary properties of the model in order to sustain sufficiency of the simple, unweighted raw score. Verhelst & Glas (1995) derive Conditional Maximum Likelihood (CML) equations for a model they refer to as the One Parameter Logistic Model (OPLM). In algebraic form it appears to be identical with the 2PL model, but OPLM contains preset discrimination indexes rather than 2PL's estimated discrimination parameters. As noted by these authors, though, the problem one faces in estimation with estimated discrimination parameters is that the discriminations are unknown, meaning that the weighted raw score "is not a mere statistic, and hence it is impossible to use CML as an estimation method" (Verhelst & Glas, 1995, p. 217). That is, sufficiency of the weighted "score" in the 2PL cannot be used according to the way in which a sufficient statistic is defined. If the weights are imputed instead of being estimated, as in OPLM, conditional estimation is possible and some of the properties of the Rasch model are retained (Verhelst, Glas & Verstralen, 1995; Verhelst & Glas, 1995). In OPLM, the values of the discrimination index are restricted to between 1 and 15. A limitation of this approach is that in practice, values of discrimination indexes must be preset as a starting point. This means some type of estimation of discrimination is involved when the purpose is to avoid doing so. The Rasch model for dichotomous data inherently entails a single discrimination parameter which, as noted by Rasch (1960/1980, p. 121), constitutes an arbitrary choice of the unit in terms of which magnitudes of the latent trait are expressed or estimated. However, the Rasch model requires that the discrimination is uniform across interactions between persons and items within a specified frame of reference (i.e. the assessment context given conditions for assessment).
Rasch model
286
Notes
[1] Linacre J.M. (2005). Rasch dichotomous model vs. One-parameter Logistic Model. Rasch Measurement Transactions, 19:3, 1032 [2] Rasch, G. (1977). On Specific Objectivity: An attempt at formalizing the request for generality and validity of scientific statements. The Danish Yearbook of Philosophy, 14, 58-93.
Rasch model Rasch, G. (1961). On general laws and the meaning of measurement in psychology, pp. 321334 in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, IV. Berkeley, California: University of California Press. Available free from Project Euclid (http://projecteuclid.org/ DPubS?verb=Display&version=1.0&service=UI&handle=euclid.bsmsp/1200512895&page=record) Verhelst, N.D. and Glas, C.A.W. (1995). The one parameter logistic model. In G.H. Fischer and I.W. Molenaar (Eds.), Rasch Models: Foundations, recent developments, and applications (pp. 215238). New York: Springer Verlag. Verhelst, N.D., Glas, C.A.W. and Verstralen, H.H.F.M. (1995). One parameter logistic model (OPLM). Arnhem: CITO. von Davier, M., & Carstensen, C. H. (2007). Multivariate and Mixture Distribution Rasch Models: Extensions and Applications. New York: Springer. Wright, B. D. (1984). Despair and hope for educational measurement. Contemporary Education Review, 3(1), 281-288 (http://www.rasch.org/memo41.htm). Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every educator and psychologist should know (pp. 65-104. Hillsdale, New Jersey: Lawrence Erlbaum Associates. Wright, B.D., & Stone, M.H. (1979). Best Test Design. Chicago, IL: MESA Press. Wu, M. & Adams, R. (2007). Applying the Rasch model to psycho-social measurement: A practical approach. Melbourne, Australia: Educational Measurement Solutions. Available free from Educational Measurement Solutions (http://www.edmeasurement.com.au/Learning.html)
287
External links
Institute for Objective Measurement Online Rasch Resources (http://www.rasch.org/memos.htm) Pearson Psychometrics Laboratory, with information about Rasch models (http://www.education.uwa.edu.au/ ppl) Journal of Applied Measurement (http://www.jampress.org) Journal of Outcome Measurement (all issues available for free downloading) (http://www.jampress.org/JOM. htm) Berkeley Evaluation & Assessment Research Center (ConstructMap software) (http://bearcenter.berkeley.edu) Directory of Rasch Software freeware and paid (http://www.rasch.org/software.htm) IRT Modeling Lab at U. Illinois Urbana Champ. (http://work.psych.uiuc.edu/irt/) National Council on Measurement in Education (NCME) (http://www.ncme.org) Rasch analysis (http://www.rasch-analysis.com/) Rasch Measurement Transactions (http://www.rasch.org/rmt/contents.htm) The Standards for Educational and Psychological Testing (http://www.apa.org/science/standards.html)
288
Rasch model
The Rasch model for dichotomous data takes the form:
where
and
where
number of persons and I is the total number of items. Solution equations are obtained by taking partial derivatives with respect to to 0. The JML solution equations are:
and
is obtained by
289
in which
is the elementary symmetric function of order r, which represents the sum over all combinations of r items. For example, in the case of three items,
Estimation algorithms
Some kind of expectation-maximization algorithm is used in the estimation of the parameters of Rasch models. Algorithms for implementing Maximum Likelihood estimation commonly employ Newton-Raphson iterations to solve for solution equations obtained from setting the partial derivatives of the log-likelihood functions equal to 0. Convergence criteria are used to determine when the iterations cease. For example, the criterion might be that the mean item estimate changes by less than a certain value, such as 0.001, between one iteration and another for all items.
References
Linacre, J.M. (2004). Estimation methods for Rasch measures. Chapter 2 in E.V. Smith & R. M. Smith (Eds.) Introduction to Rasch Measurement. Maple Grove MN: JAM Press. Linacre, J.M. (2004). Rasch model estimation: further topics. Chapter 24 in E.V. Smith & R. M. Smith (Eds.) Introduction to Rasch Measurement. Maple Grove MN: JAM Press.
Rating scale
290
Rating scale
Concerning rating scales as systems of educational marks, see articles about education in different countries (named "Education in ..."), for example, Education in Ukraine. Concerning rating scales used in the practice of medicine, see articles about diagnoses, for example, Major depressive disorder.
An example of a common type of rating scale, the "rate this with 1 to 5 stars" model. This example is from Wikipedia's user-survey efforts.
A rating scale is a set of categories designed to elicit information about a quantitative or a qualitative attribute. In the social sciences, common examples are the Likert scale and 1-10 rating scales in which a person selects the number which is considered to reflect the perceived quality of a product.
Background
A rating scale is a method that requires the rater to assign a value, sometimes numeric, to the rated object, as a measure of some rated attribute.
Rating scale
291
Validity
With each user rating a product only once, for example in a category from 1 to 10, there is no means for evaluating internal reliability using an index such as Cronbach's alpha. It is therefore impossible to evaluate the validity of the ratings as measures of viewer perceptions. Establishing validity would require establishing both reliability and accuracy (i.e. that the ratings represent what they are supposed to represent).The degree of validity of an instrument is determined through the application of logic/or statistical procedures." A measurement procedure is valid to the degree that if measures what it proposes to measure" Another fundamental issue is that online ratings usually involve convenience sampling much like television polls, i.e. they represent only the opinions of those inclined to submit ratings. TYPES OF VALIDITY Validity is concerned with different aspects of the measurement process.Each of these types uses logic, statistical verification or both to determine the degree of validity and has special value under certain conditions. 1. CONTENT VALIDITY 2. PREDICTIVE VALIDITY 3. CONSTRUCT VALIDITY
Sampling
Sampling errors can lead to results which have a specific bias, or are only relevant to a specific subgroup. Consider this example: suppose that a film only appeals to a specialist audience90% of them are devotees of this genre, and only 10% are people with a general interest in movies. Assume the film is very popular among the audience that views it, and that only those who feel most strongly about the film are inclined to rate the film online; hence the raters are all drawn from the devotees. This combination may lead to very high ratings of the film, which do not generalize beyond the people who actually see the film (or possibly even beyond those who actually rate it).
Qualitative description
Qualitative description of categories improve the usefulness of a rating scale. For example, if only the points 1-10 are given without description, some people may select 10 rarely, whereas others may select the category often. If, instead, "10" is described as "near flawless", the category is more likely to mean the same thing to different people. This applies to all categories, not just the extreme points. The above issues are compounded, when aggregated statistics such as averages are used for lists and rankings of products. User ratings are at best ordinal categorizations. While it is not uncommon to calculate averages or means for such data, doing so cannot be justified because in calculating averages, equal intervals are required to represent the same difference between levels of perceived quality. The key issues with aggregate data based on the kinds of rating scales commonly used online are as follow: Averages should not be calculated for data of the kind collected. It is usually impossible to evaluate the reliability or validity of user ratings.
Rating scale Products are not compared with respect to explicit, let alone commonWikipedia:Please clarify, criteria. Only users inclined to submit a rating for a product do so. Data are not usually published in a form that permits evaluation of the product ratings. More developed methodologies include Choice Modelling or Maximum Difference methods, the latter being related to the Rasch model due to the connection between Thurstone's law of comparative judgementWikipedia:Please clarify and the Rasch model.
292
References
[1] Andrich, D. (1978). "A rating formulation for ordered response categories". Psychometrika, 43, 357-74. [2] Cronbach, L. J. (1951). "Coefficient alpha and the internal structure of tests". Psychometrika, 16, 297-333.
External links
How to apply Rasch analysis (http://www.rasch-analysis.com/)
Rating scales for depression The Geriatric Depression Scale (GDS) is another self-administered scale, but in this case it is used for older patients, and for patients with mild to moderate dementia. Instead of presenting a five-category response set, the GDS questions are answered with a simple "yes" or "no".[4][] The Zung Self-Rating Depression Scale is similar to the Geriatric Depression Scale in that the answers are preformatted. In the Zung Self-Rating Depression Scale, there are 20 items: ten positively-worded and ten negatively-worded. Each question is rated on a scale of 1 through 4 based on four possible answers: "a little of the time", "some of the time", "good part of the time", and "most of the time".[] The Patient Health Questionnaire (PHQ) sets are self-reported depression rating scales. For example, the Patient Health Questionnaire-9 (PHQ-9) is a self-reported, 9-question version of the Primary Care Evaluation of Mental Disorders.[] The Patient Health Questionnaire-2 (PHQ-2) is a shorter version of the PHQ-9 with two screening questions to assess the presence of a depressed mood and a loss of interest or pleasure in routine activities; a positive response to either question indicates further testing is required.[]
293
Usefulness
Screening programs using rating scales to search for candidates for a more in-depth evaluation have been advocated to improve detection of depression, but there is evidence that they do not improve detection rates, treatment, or outcome.[5] There is also evidence that a consensus on the interpretation of rating scales, in particular the Hamilton Rating Scale for Depression, is largely missing, leading to misdiagnosis of the severity of a patient's depression.[6] However, there is evidence that portions of rating scales, such as the somatic section of the PHQ-9, can be useful in predicting outcomes for subgroups of patients like coronary heart disease patients.[7]
References
[8] Zimmerman M. Using scales to monitor symptoms and treatment of depression (measurement based care). In UpToDate, Rose, BD (Ed), UpToDate, Waltham, MA, 2011. [11] OutcomeTracker (http:/ / www. outcometracker. org/ ) - Clinically Useful Depression Outcome Scale (CUDOS) official website [13] Inventory of Depressive Symptomatology (IDS) and Quick Inventory of Depressive Symptomatology (QIDS) (http:/ / www. ids-qids. org/ ). official website
Reliability (psychometrics)
294
Reliability (psychometrics)
In the psychometrics, reliability is used to describe the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions. For example, measurements of peoples height and weight are often extremely reliable.[1][2]
Types
There are several general classes of reliability estimates: Inter-rater reliability assesses the degree to which test scores are consistent when measurements are taken by different people using the same methods. Test-retest reliability assesses the degree to which test scores are consistent from one test administration to the next. Measurements are gathered from a single rater who uses the same methods or instruments and the same testing conditions.[2] This includes intra-rater reliability. Inter-method reliability assesses the degree to which test scores are consistent when there is a variation in the methods or instruments used. This allows inter-rater reliability to be ruled out. When dealing with forms, it may be termed parallel-forms reliability.[3] Internal consistency reliability, assesses the consistency of results across items within a test.[3]
Reliability (psychometrics)
295
General model
In practice, testing measures are never perfectly consistent.Theories of test reliability have been developed to estimate the effects of inconsistency on the accuracy of measurement. The basic starting point for almost all theories of test reliability is the idea that test scores reflect the influence of two sorts of factors:[] 1. Factors that contribute to consistency: stable characteristics of the individual or the attribute that one is trying to measure 2. Factors that contribute to inconsistency: features of the individual or the situation that can affect test scores but have nothing to do with the attribute being measured Some of these inconsistencies include:[] Temporary but general characteristics of the individual: health, fatigue, motivation, emotional strain Temporary and specific characteristics of individual: comprehension of the specific test task, specific tricks or techniques of dealing with the particular test materials, fluctuations of memory, attention or accuracy Aspects of the testing situation: freedom from distractions, clarity of instructions, interaction of personality, sex, or race of examiner Chance factors: luck in selection of answers by sheer guessing, momentary distractions The goal of estimating reliability is to determine how much of the variability in test scores is due to errors in measurement and how much is due to variability in true scores.[] A true score is the replicable feature of the concept being measured. It is the part of the observed score that would recur across different measurement occasions in the absence of error. Errors of measurement are composed of both random error and systematic error. It represents the discrepancies between scores obtained on tests and the corresponding true scores. This conceptual breakdown is typically represented by the simple equation:
This equation suggests that test scores vary as the result of two factors: 1. Variability in true scores
Reliability (psychometrics) 2. Variability due to errors of measurement. The reliability coefficient provides an index of the relative influence of true and error scores on attained test scores. In its general form, the reliability coefficient is defined as the ratio of true score variance to the total variance of test scores. Or, equivalently, one minus the ratio of the variation of the error score and the variation of the observed score:
296
Unfortunately, there is no way to directly observe or calculate the true score, so a variety of methods are used to estimate the reliability of a test. Some examples of the methods to estimate reliability include test-retest reliability, internal consistency reliability, and parallel-test reliability. Each method comes at the problem of figuring out the source of error in the test somewhat differently.
Estimation
The goal of estimating reliability is to determine how much of the variability in test scores is due to errors in measurement and how much is due to variability in true scores. Four practical strategies have been developed that provide workable methods of estimating test reliability.[] 1. Test-retest reliability method: directly assesses the degree to which test scores are consistent from one test administration to the next. It involves: Administering a test to a group of individuals Re-administering the same test to the same group at some later time Correlating the first set of scores with the second The correlation between scores on the first test and the scores on the retest is used to estimate the reliability of the test using the Pearson product-moment correlation coefficient: see also item-total correlation. 2. Parallel-forms method: The key to this method is the development of alternate test forms that are equivalent in terms of content, response processes and statistical characteristics. For example, alternate forms exist for several tests of general intelligence, and these tests are generally seen equivalent.[] With the parallel test model it is possible to develop two forms of a test that are equivalent in the sense that a persons true score on form A would be identical to their true score on form B. If both forms of the test were administered to a number of people, differences between scores on form A and form B may be due to errors in measurement only.[] It involves: Administering one form of the test to a group of individuals At some later time, administering an alternate form of the same test to the same group of people
Reliability (psychometrics) Correlating scores on form A with scores on form B The correlation between scores on the two alternate forms is used to estimate the reliability of the test. This method provides a partial solution to many of the problems inherent in the test-retest reliability method. For example, since the two forms of the test are different, carryover effect is less of a problem. Reactivity effects are also partially controlled; although taking the first test may change responses to the second test. However, it is reasonable to assume that the effect will not be as strong with alternate forms of the test as with two administrations of the same test.[] However, this technique has its disadvantages: It may very difficult to create several alternate forms of a test It may also be difficult if not impossible to guarantee that two alternate forms of a test are parallel measures 3. Split-half method: This method treats the two halves of a measure as alternate forms. It provides a simple solution to the problem that the parallel-forms method faces: the difficulty in developing alternate forms.[] It involves: Administering a test to a group of individuals Splitting the test in half Correlating scores on one half of the test with scores on the other half of the test The correlation between these two split halves is used in estimating the reliability of the test. This halves reliability estimate is then stepped up to the full test length using the SpearmanBrown prediction formula. There are several ways of splitting a test to estimate reliability. For example, a 40-item vocabulary test could be split into two subtests, the first one made up of items 1 through 20 and the second made up of items 21 through 40. However, the responses from the first half may be systematically different from responses in the second half due to an increase in item difficulty and fatigue.[] In splitting a test, the two halves would need to be as similar as possible, both in terms of their content and in terms of the probable state of the respondent. The simplest method is to adopt an odd-even split, in which the odd-numbered items form one half of the test and the even-numbered items form the other. This arrangement guarantees that each half will contain an equal number of items from the beginning, middle, and end of the original test.[] 4. Internal consistency: assesses the consistency of results across items within a test. The most common internal consistency measure is Cronbach's alpha, which is usually interpreted as the mean of all possible split-half coefficients.[5] Cronbach's alpha is a generalization of an earlier form of estimating internal consistency, KuderRichardson Formula 20.[5] Although the most commonly used, there are some misconceptions regarding Cronbach's alpha.[6] [7] These measures of reliability differ in their sensitivity to different sources of error and so need not be equal. Also, reliability is a property of the scores of a measure rather than the measure itself and are thus said to be sample dependent. Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true variability is different in this second population. (This is true of measures of all typesyardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.) Reliability may be improved by clarity of expression (for written assessments), lengthening the measure,[5] and other informal means. However, formal psychometric analysis, called item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of item difficulties and item discrimination indices, the latter index involving computation of correlations between the items and sum of the item scores of the entire test. If items that are too difficult, too easy, and/or have near-zero or negative discrimination are replaced with better
297
Reliability (psychometrics) items, the reliability of the measure will increase. (where is the failure rate)
298
References
[2] The Marketing Accountability Standards Board (MASB) endorses this definition as part of its ongoing Common Language: Marketing Activities and Metrics Project (http:/ / www. themasb. org/ common-language-project/ ). [3] Types of Reliability (http:/ / www. socialresearchmethods. net/ kb/ reltypes. php) The Research Methods Knowledge Base. Last Revised: 20 October 2006 [5] Cortina, J.M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78(1), 98104. [6] Ritter, N. (2010). Understanding a widely misunderstood statistic: Cronbach's alpha. Paper presented at Southwestern Educational Research Association (SERA) Conference 2010, New Orleans, LA (ED526237).
External links
Uncertainty models, uncertainty quantification, and uncertainty processing in engineering (http://www. uncertainty-in-engineering.net) The relationships between correlational and internal consistency concepts of test reliability (http://www. visualstatistics.net/Statistics/Principal Components of Reliability/PCofReliability.asp) The problem of negative reliabilities (http://www.visualstatistics.net/Statistics/Reliability Negative/Negative Reliability.asp)
Repeatability
Repeatability or test-retest reliability[1] is the variation in measurements taken by a single person or instrument on the same item and under the same conditions. A less-than-perfect test-retest reliability causes test-retest variability. Such variability can be caused by, for example, intra-individual variability and intra-observer variability. A measurement may be said to be repeatable when this variation is smaller than some agreed limit. Test-retest variability is practically used, for example, in medical monitoring of conditions. In these situations, there is often a predetermined "critical difference", and for differences in monitored values that are smaller than this critical difference, the possibility of pre-test variability as a sole cause of the difference may be considered in addition to, for examples, changes in diseases or treatments.[]
Establishment
According to the Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results, the following conditions need to be fulfilled in the establishment of repeatability: the same measurement procedure the same observer the same measuring instrument, used under the same conditions the same location repetition over a short period of time.
Repeatability methods were developed by Bland and Altman (1986).[2] If the correlation between separate administrations of the test is high (e.g. 0.7 or higher as in this Cronbach's alpha-internal consistency-table[3]), then it has good test-retest reliability.
Repeatability The repeatability coefficient is a precision measure which represents the value below which the absolute difference between two repeated test results may be expected to lie with a probability of 95%. The standard deviation under repeatability conditions is part of precision and accuracy.
299
Desirability of repeatability
Test-retest reliability is desirable in measures of constructs that are not expected to change over time. For example, if you use a certain method to measure an adult's height, and then do the same again two years later, you would expect a very high correlation; if the results differed by a great deal, you would suspect that the measure was inaccurate. The same is true for personality traits such as extraversion, which are believed to change only very slowly. In contrast, if you were trying to measure mood, you would expect only moderate test-retest reliability, since people's moods are expected to change from day to day. Very high test-retest reliability would be bad, since it would suggest that you were not picking up on these changes.
Psychological testing
Since the same test is administered twice and every test is parallel with itself, differences between scores on the test and scores on the retest should be due solely to measurement error. This sort of argument is quite probably true for many physical measurements. However, this argument is often inappropriate for psychological measurement, since it is often impossible to consider the second administration of a test a parallel measure to the first.[] The second administration of a psychological test might yield systematically different scores than the first administration due to the following reasons[]: 1. The attribute that is being measured may change between the first test and the retest. For example, a reading test that is administered in September to a third grade class may yield different results when retaken in June. We would expect some change in childrens reading ability over that span of time, a low test-retest correlation might reflect real changes in the attribute itself. 2. The experience of taking the test itself can change a persons true score. For example, completing an anxiety inventory could serve to increase a persons level of anxiety. 3. Carryover effect, particularly is the interval between test and retest is short. When retested, people may remember their original answer, which could affect answers on the second administration.
Repeatability
300
References
[1] Types of Reliability (http:/ / www. socialresearchmethods. net/ kb/ reltypes. php) The Research Methods Knowledge Base. Last Revised: 20 October 2006 [2] http:/ / www-users. york. ac. uk/ ~mb55/ meas/ ba. htm [3] George, D., & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference. 11.0 update (4th ed.). Boston: Allyn & Bacon. [4] http:/ / www. isixsigma. com/ tools-templates/ measurement-systems-analysis-msa-gage-rr/ attribute-agreement-analysis-defect-databases/
External links
Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results; appendix D (http:// physics.nist.gov/Pubs/guidelines/appd.1.html)
Reproducibility
Reproducibility is the ability of an entire experiment or study to be reproduced, or by someone else working independently. It is one of the main principles of the scientific method. The result values are said to be commensurate if they are obtained (in distinct experimental trials) according to the same reproducible experimental description and procedure. The basic idea can be seen in Aristotle's dictum that there is no scientific knowledge of the individual, where the word used for individual in Greek had the connotation of the idiosyncratic, or wholly isolated occurrence. Thus all knowledge, all science, necessarily involves the formation of general concepts and the invocation of their corresponding symbols in language (cf. Turner). Reproducibility also refers to the degree of agreement between measurements or observations conducted on replicate specimens in different locations by different people, as part of the precision of a test method.[1]
Reproducible data
Reproducibility is one component of the precision of a test method. The other component is repeatability which is the degree of agreement of tests or measurements on replicate specimens by the same observer in the same laboratory. Both repeatability and reproducibility are usually reported as a standard deviation. A reproducibility limit is the value below which the difference between two test results obtained under reproducibility conditions may be expected to occur with a probability of approximately 0.95 (95%).[2] Reproducibility is determined from controlled interlaboratory test programs.[3][4]
Reproducible research
The term reproducible research refers to the idea that the ultimate product of research is the paper along with the full computational environment used to produce the results in the paper such as the code, data, etc. necessary for reproduction of the results and building upon the research.[5][6][7] In 2012, a study found that 47 out of 53 medical research papers on the subject of cancer were irreproducible.[8] John P. A. Ioannidis wrote: While currently there is unilateral emphasis on "first" discoveries, there should be as much emphasis on replication of discoveries."[9] While repeatability of scientific experiments is desirable, it is not considered necessary to establish the scientific validity of a theory. For example, the cloning of animals is difficult to repeat, but has been reproduced by various teams working independently, and is a well established research domain. One failed cloning does not mean that the theory is wrong or unscientific. Repeatability is often low in protosciences.
Reproducibility
301
References
[1] ASTM E177 [2] [3] [4] [5] ASTM E177 ASTM E691 Standard Practice for Conducting an Interlaboratory Study to Determine the Precision of a Test Method ASTM F1469 Standard Guide for Conducting a Repeatability and Reproducibility Study on Test Equipment for Nondestructive Testing Sergey Fomel and Jon Claerbout, " Guest Editors' Introduction: Reproducible Research (http:/ / www. rrplanet. com/ reproducible-research-librum/ viewtopic. php?f=30& t=372)," Computing in Science and Engineering, vol. 11, no. 1, pp. 57, Jan./Feb. 2009, [6] J. B. Buckheit and D. L. Donoho, " WaveLab and Reproducible Research (http:/ / www. rrplanet. com/ reproducible-research-librum/ viewtopic. php?f=30& t=53)," Dept. of Statistics, Stanford University, Tech. Rep. 474, 1995. [7] The Yale Law School Round Table on Data and Core Sharing: " Reproducible Research (http:/ / www. computer. org/ portal/ web/ csdl/ doi/ 10. 1109/ MCSE. 2010. 113)", Computing in Science and Engineering, vol. 12, no. 5, pp. 812, Sept/Oct 2010, [8] http:/ / www. nature. com/ nature/ journal/ v483/ n7391/ full/ 483531a. html [9] Is the spirit of Piltdown man alive and well? (http:/ / www. telegraph. co. uk/ technology/ 3342867/ Is-the-spirit-of-Piltdown-man-alive-and-well. html) [10] Cheney, Margaret(1999), Tesla Master of Lightning, New York: Barnes & Noble Books, ISBN 0-7607-1005-8, pp. 107.; "Unable to overcome his financial burdens, he was forced to close the laboratory in 1905."
Turner, William (1903), History of Philosophy, Ginn and Company, Boston, MA, Etext (http://www2.nd.edu/ Departments//Maritain/etext/hop.htm). See especially: "Aristotle" (http://www2.nd.edu/Departments// Maritain/etext/hop11.htm). Definition (PDF) (http://www.iupac.org/goldbook/R05305.pdf),Wikipedia:Link rot by International Union of Pure and Applied Chemistry
External links
Reproducible Research in Computational Science (http://www.csee.wvu.edu/~xinl/source.html) Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results; appendix D (http:// physics.nist.gov/Pubs/guidelines/appd.1.html) Definition of reproducibility in the IUPAC Gold Book (http://goldbook.iupac.org/R05305.html) Detailed article on Reproducibility (http://arstechnica.com/journals/science.ars/2006/10/25/5744) Reproducible Research Planet (http://www.rrplanet.com/) ReproducibleResearch.net (http://www.reproducibleresearch.net)
Riddle scale
302
Riddle scale
The Riddle scale (also known as Riddle homophobia scale or Riddle scale of homophobia) is a psychometric scale that measures the degree to which a person is or is not homophobic. The scale is frequently used in tolerance education about anti-discriminatory attitudes regarding sexual orientation. It is named after its creator, psychologist Dorothy Riddle.
Overview
The Riddle homophobia scale was developed by Dorothy Riddle in 197374 while she was overseeing research for the American Psychological Association Task Force on Gays and Lesbians.[1] The scale was distributed at talks and workshops but was not formally published for a long time; it is cited in the literature either as an (unpublished) conference presentation from 1985[2] or as an article from 1994.[3] At the time it was developed, Riddle's analysis was one of the first modern classifications of attitudes towards homosexuality.[citation needed] In that respect, the scale has served the purpose that Riddle originally had in mind: she devised the scale to explicate the continuum of attitudes toward gays and lesbians and to assess the current and desired institutional culture of an organization or a work place.[4]
Level of measurement
The Riddle scale is an eight-term uni-dimensional Likert-type interval scale with nominal labels and no explicit zero point. Each term is associated with a set of attributes and beliefs; individuals are assigned a position on the scale based on the attributes they exhibit and beliefs they hold. The scale is frequently divided into two parts, the 'homophobic levels of attitude' (first four terms) and the 'positive levels of attitude' (last four terms).[5]
The scale
Repulsion: Homosexuality is seen as a crime against nature. Gays/lesbians are considered sick, crazy, immoral, sinful, wicked, etc. Anything is justified to change them: incarceration, hospitalization, behavior therapy, electroshock therapy, etc. Pity: Represents heterosexual chauvinism. Heterosexuality is considered more mature and certainly to be preferred. It is believed that any possibility of becoming straight should be reinforced, and those who seem to be born that way should be pitied as less fortunate ("the poor dears"). Tolerance: Homosexuality is viewed as a phase of adolescent development that many people go through and most people grow out of. Thus, lesbians/gays are less mature than straights and should be treated with the protectiveness and indulgence one uses with children who are still maturing. It is believed that lesbians/gays should not be given positions of authority because they are still working through their adolescent behavior. Acceptance: Still implies that there is something to accept; the existing climate of discrimination is ignored. Characterized by such statements as "You're not lesbian to me, you're a person!" or "What you do in bed is your own business." or "That's fine with me as long as you don't flaunt it!" Support: People at this level may be uncomfortable themselves, but they are aware of the homophobic climate and the irrational unfairness, and work to safeguard the rights of lesbians and gays. Admiration: It is acknowledged that being lesbian/gay in our society takes strength. People at this level are willing to truly examine their homophobic attitudes, values, and behaviors.
Riddle scale Appreciation: The diversity of people is considered valuable and lesbians/gays are seen as a valid part of that diversity. People on this level are willing to combat homophobia in themselves and others. Nurturance: Assumes that gay/lesbian people are indispensable in our society. People on this level view lesbians/gays with genuine affection and delight, and are willing to be their allies and advocates.
303
Discussion
Riddle's analysis has been credited for pointing out that although 'tolerance' and 'acceptance' can be seen as positive attitudes, they should actually be treated as negative because they can mask underlying fear or hatred (somebody can tolerate a baby crying on an airplane while at the same time wishing that it would stop) or indicate that there is indeed something that we need to accept, and that we are the ones with the power to reject or to accept.[6][7] This observation generalizes to attitude evaluations in other areas besides sexual orientation and is one of the strengths of Riddle's study. Although it deals mostly with adult attitudes towards difference, the model has been positioned in the cognitive developmental tradition of Piaget and Kohlberg's stages of moral development.[8] As a psychometric scale, the Riddle scale has been considered to have acceptable face validity but its exact psychometric properties are unknown.[9][10]
References
[1] Staten Island LGBT history (http:/ / www. silgbtcenter. org/ ) Staten Island LGBT Community Center, Accessed Dec. 19, 2010. [2] Riddle, D. I. (1985). Homophobia scale. Opening doors to understanding and acceptance: A facilitators guide for presenting workshops on lesbian and gay issues, Workshop organized by Kathy Obear and Amy Reynolds, Boston. Unpublished essay. [3] Riddle, D., (1994). The Riddle scale. Alone no more: Developing a school support system for gay, lesbian and bisexual youth. St Paul: Minnesota State Department. [4] Peterkin, A. Risdon, C., (2003). Caring for lesbian and gay people: A clinical guide. Toronto: University of Toronto Press, Inc. [5] Clauss-Ehlers, C. S. (ed), (2010). Encyclopedia of Cross-Cultural School Psychology. New York: Springer. [6] Blumenfeld W. J. (2000). How homophobia hurts everyone. Readings for diversity and social justice. New York: Routledge, 267275. [7] Ollis, D., (2004). Im just a home economics teacher. Does discipline background impact on teachers ability to affirm and include gender and sexual diversity in secondary school health education programs? AARE Conference, Melbourne 2004 [8] Hirscheld, S., (2001). Moving beyond the safety zone: A staff development approach to anti-heterosexist education. Fordham Urban Law Journal, 29, 611641. [9] Finkel, M. J., Storaasli, R. D., Bandele, A., and Schaefer, V., (2003). Diversity training in graduate school: An exploratory evaluation of the safe zone project. Professional Psychology: Research and Practice, 34, 555561. [10] Tucker, E. W, and Potocky-Tripodi, M., (2006). Changing heterosexuals' attitudes toward homosexuals: A systematic review of the empirical literature. Research on Social Work Practice, 16 (2), 176190.
304
References
[3] ,
Confidence weighting
The Confidence Weighting (CW) construct is concerned with indices that connect an outside observer to the respondents inner state of knowledge certainty toward specific content.[1][2][3][4] Underpinning the CW construct of the Risk Inclination Model is the individual's experience of coherence or rightness[5] and is used to calibrate the relationship between a respondents objective and observable measures of risk taking (i.e., weighted indices toward answer selections) with his or her subjective inner feelings of knowledge certainty (i.e., feelings of rightness).
RIM
Restricted context
The restricted context (RC) construct is based on Piagets theory of equilibration[6] and allows the outside observer to measure the way a respondent manages competing inner states of knowledge certainty during the application of confidence weights among items within the restricted Total Point Value (TPV) context of the test. RC sets the parameters where risk taking toward knowledge certainty occurs. These parameters are important because they allow an observer to scale and thereby measure the respondents inner state of equilibration among related levels of knowledge certainty. Equilibration is defined as a self-regulatory process that reflects the biological drive to produce an optimal state of balance between a persons cognitive structures (i.e., inner state) and their environment.[7]
305
References
[8] , [9] Coxeter, H. S. M. and Greitzer, S. L. "Quadrangle; Varignon's theorem" 3.1 in Geometry Revisited. Washington, DC: Math. Assoc. Amer., pp. 5254, 1967.
Role-based assessment
Modern psychological testing can be traced back to 1908 with the introduction of the first successful intelligence test, the Binet-Simon Scale.[1] From the Binet-Simon came the revised version, the Stanford-Binet, which was used in the development of the Army Alpha and Army Beta tests used by the United States military.[2] During World War I, Robert S. Woodworth developed the Woodworth Personal Data Sheet (WPDS), to determine which soldiers were better prepared to handle the stresses of combat. The WPDS signaled a shift in the focus of psychological testing from intellect to personality.[3] By the 1940s, the quantitative measurement of personality traits had become a central theme in psychology, and it has remained so into the 2000s. During this time, numerous variations and versions of personality tests have been created, including the widely used Myers-Briggs, DISC, and Cattells 16PF Questionnaire.[4] Role-Based Assessment (RBA) differs significantly from personality testing.[5] Instead of quantifying individual personality factors, RBAs methodology was developed, from its very beginnings, to make qualitative observations of human interaction.[6] In this sense, RBA is a form of behavioral simulation. Understanding the quality of a persons behavior on a team can be a valuable adjunct to other forms of evaluation (such as data on experience, knowledge, skills, and personality) because the ability to successfully cooperate and collaborate with others is fundamental to organizational performance.
Concepts
Coherence
In TGI Role-Based Assessment, Coherence describes a positive and constructive orientation to working with others to achieve common goals, overcome obstacles, and meet organizational needs.[7][8][9]
Role
A persons Role describes their strongest affinity for, or attraction to, serving a certain type of organizational need, e.g., planning for the future vs. executing current tasks vs. preserving and sharing knowledge.[10][11]
Teaming Characteristics
Each RBA report includes a detailed section on Teaming Characteristics, which are derived, in part, from the relationship between a persons level of Coherence and their unique Role (or Roles). As their name suggests, Teaming Characteristics can help managers and coaches to understand how well a person will fit within a team
306
Historical Development
Dr. Janice Presser began collaborating with Dr. Jack Gerber in 1988 to develop tools and methods for measuring the fundamental elements of human teaming behavior, with a goal of improving individual and team performance. Their work combines decodes of research, blending Dr. Pressers earlier work in family and social relationships with Dr. Gerbers Mosaic Figures test, which had been designed to produce qualitative information on how individuals view other people.[14] Three generations of assessments were developed, tested and used in the context of actual business performance. The initial Executive Behavior Assessment was focused on the behavior of persons with broad responsibility for organizational performance. The second iteration, called the Enhanced Executive Behavior Assessment, incorporated metrics on the behavior of executives working in teams. Drs. Presser and Gerber then successfully applied their testing methodology to team contributors outside of the executive ranks, and as development and testing efforts continued, Role-Based Assessment (RBA) emerged.[15] By 1999, RBA was established as a paper-based assessment, and was being sold for use in pre-hire screening and organizational development.[16] Drs Presser and Gerber formed The Gabriel Institute in 2001, with the goal of making RBA available to a greater audience via the Internet.[17] Mid-year in 2009, TGI Role-Based AssessmentTM became generally available as an online assessment instrument. Later in 2009, the Society for Human Resource Management (SHRM) published a two-part white paper by Dr. Presser, which introduced ground- breaking ideas on the measurement and valuation of human synergy in organizations, and an approach to the creation of a strong, positively-oriented human infrastructure.[18][19]
Applications
The most common use of TGI Role-Based Assessment is in pre-hire screening evaluations. RBAs focus on teaming behavior offers a different way to allegedly predict how an individual will fit with company culture, on a given team, and how they are likely to respond to specific job requirements.[20] While other pre-hire testing may run the "risk of violating the ADA" (Americans with Disabilities Act), this does not appear to be an issue with Role-Based Assessment.[21] RBA is also claimed to have unique potential for strengthening a human infrastructure. Results from RBA reports can be aggregated, providing quantitative data that is used for analysis and resolution of team performance problems, and to identify and select candidates for promotion.[22]
References
[1] Santrock, John W. (2008) A Topical Approach to Life-Span Development (4th Ed.) Concept of Intelligence (283-284) New York: McGraw-Hill. [2] Fancher, R. (1985). The Intelligence Men: Makers of the IQ Controversy. New York:W.W. Norton & Company [4] Personality Theories, Types and Tests. (http:/ / www. businessballs. com/ personalitystylesmodels. htm) Businessballs.com. 2009. [18] SHRM - The Measurement & Valuation of Human Infrastructure: An Introduction to CHI Indicators (http:/ / www. shrm. org/ Research/ Articles/ Articles/ Pages/ InfrastructureCHI. aspx) [19] SHRM The Measurement & Valuation of Human Infrastructure: An Intro. To the New Way to Know (http:/ / www. shrm. org/ Research/ Articles/ Articles/ Pages/ New Way to Know. aspx) [20] Edmonds Wickman, Lindsay. Role-Based Assessment: Thinking Inside the Box. (http:/ / talentmgt. com/ articles/ view/ rolebased_assessment_thinking_inside_the_box/ 3) Talent Management Magazine (October 2008). Media Tec Publishing Inc. [22] Edmonds Wickman, Lindsay. Role-Based Assessment: Thinking Inside the Box. (http:/ / talentmgt. com/ articles/ view/ rolebased_assessment_thinking_inside_the_box/ 3) Talent Management Magazine (October 2008). Media Tec Publishing Inc.
Role-based assessment
307
External links
University of Pennsylvania Journal of Labor and Employment Law (http://www.law.upenn.edu/journals/jbl/ articles/volume9/issue1/Gonzales-Frisbie9U.Pa.J.Lab.&Emp.L.185(2006).pdf) Innovation America Put Your Money Where Your Team Is! (http://www.innovationamerica.us/index.php/ innovation-daily/3780-put-your-money-where- your-team-is-) National Association of Seed and Venture Funds (NASVF) Make Sure People Will FitBefore You Hire Them. (http://www.nasvf.org/index.php?option=com_content&view=article& id=146:make-sure-people-will-fit-nbefore-you-hire-them&catid=5:features&Itemid=38)
Composite measures
Composite measures of variables are created by combining two or more separate empirical indicators into a single measure. Composite measures measure complex concepts more adequately than single indicators, extend the range of scores available and are more efficient at handling multiple items. In addition to scales, there are two other types of composite measures. Indexes are similar to scales except multiple indicators of a variable are combined into a single measure. The index of consumer confidence, for example, is a combination of several measures of consumer attitudes. A typology is similar to an index except the variable is measured at the nominal level. Indexes are constructed by accumulating scores assigned to individual attributes, while scales are constructed through the assignment of scores to patterns of attributes. While indexes and scales provide measures of a single dimension, typologies are often employed to examine the intersection of two or more dimensions. Typologies are very useful analytical tools and can be easily used as independent variables, although since they are not unidimensional it is difficult to use them as a dependent variable.
308
Data types
The type of information collected can influence scale construction. Different types of information are measured in different ways. 1. Some data are measured at the nominal level. That is, any numbers used are mere labels : they express no mathematical properties. Examples are SKU inventory codes and UPC bar codes. 2. Some data are measured at the ordinal level. Numbers indicate the relative position of items, but not the magnitude of difference. An example is a preference ranking. 3. Some data are measured at the interval level. Numbers indicate the magnitude of difference between items, but there is no absolute zero point. Examples are attitude scales and opinion scales. 4. Some data are measured at the ratio level. Numbers indicate magnitude of difference and there is a fixed zero point. Ratios can be calculated. Examples include: age, income, price, costs, sales revenue, sales volume, and market share.
Scale (social sciences) approach within a probabilistic framework. Constant sum scale a respondent is given a constant sum of money, script, credits, or points and asked to allocate these to various items (example : If you had 100 Yen to spend on food products, how much would you spend on product A, on product B, on product C, etc.). This is an ordinal level technique. Magnitude estimation scale In a psychophysics procedure invented by S. S. Stevens people simply assign numbers to the dimension of judgment. The geometric mean of those numbers usually produces a power law with a characteristic exponent. In cross-modality matching instead of assigning numbers, people manipulate another dimension, such as loudness or brightness to match the items. Typically the exponent of the psychometric function can be predicted from the magnitude estimation exponents of each dimension.
309
Scale evaluation
Scales should be tested for reliability, generalizability, and validity. Generalizability is the ability to make inferences from a sample to the population, given the scale you have selected. Reliability is the extent to which a scale will produce consistent results. Test-retest reliability checks how similar the results are if the research is repeated under similar circumstances. Alternative forms reliability checks how similar the results are if the research is repeated using different forms of the scale. Internal consistency reliability checks how well the individual measures included in the scale are converted into a composite measure. Scales and indexes have to be validated. Internal validation checks the relation between the individual measures included in the scale, and the composite scale itself. External validation checks the relation between the composite scale and other indicators of the variable, indicators not included in the scale. Content validation (also called face validity) checks how well the scale measures what is supposed to measured. Criterion validation checks how meaningful the scale criteria are relative to other possible criteria. Construct validation checks what underlying construct is being measured. There are three variants of construct validity. They are convergent validity, discriminant validity, and nomological validity (Campbell and Fiske, 1959; Krus and Ney, 1978). The coefficient of reproducibility indicates how well the data from the individual measures included in the scale can be reconstructed from the composite scale.
310
Further reading
DeVellis, Robert F (2003), Scale Development: Theory and Applications [1] (2nd ed.), London: SAGE Publications, ISBN0-7619-2604-6 (cloth), retrieved 11 August 2010 Paperback ISBN 0-7619-2605-4 Lodge, Milton (1981), Magnitude Scaling: Quantitative Measurement of Opinions, Beverly Hills & London: SAGE Publications, ISBN0-8039-1747-3 McIver, John P. & Carmines, Edward G (1981), Unidimensional Scaling [2], Beverly Hills & London: SAGE Publications, ISBN0-8039-1736-8, retrieved 11 August 2010
References
Bradley, R.A. & Terry, M.E. (1952): Rank analysis of incomplete block designs, I. the method of paired comparisons. Biometrika, 39, 324345. Campbell, D. T. & Fiske, D. W. (1959) Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81105. Hodge, D. R. & Gillespie, D. F. (2003). Phrase Completions: An alternative to Likert scales. Social Work Research, 27(1), 4555. Hodge, D. R. & Gillespie, D. F. (2005). Phrase Completion Scales. In K. Kempf-Leonard (Editor). Encyclopedia of Social Measurement. (Vol. 3, pp.5362). San Diego: Academic Press. Krus, D. J. & Kennedy, P. H. (1977) Normal scaling of dominance matrices: The domain-referenced model. Educational and Psychological Measurement, 37, 189193 (Request reprint). [3] Krus, D. J. & Ney, R. G. (1978) Convergent and discriminant validity in item analysis. Educational and Psychological Measurement, 38, 135137 (Request reprint). [4] Luce, R.D. (1959): Individual Choice Behaviours: A Theoretical Analysis. New York: J. Wiley.
External links
Handbook of Management Scales Multi-item metrics to be used in research, Wikibooks [5]
References
[1] [2] [3] [4] [5] http:/ / books. google. com/ books?id=BYGxL6xLokUC& printsec=frontcover& dq=scale+ development#v=onepage& q& f=false http:/ / books. google. com/ books?id=oL8xP7EX9XIC& printsec=frontcover& dq=unidimensional+ scaling#v=onepage& q& f=false http:/ / www. visualstatistics. net/ Scaling/ Domain%20Referenced%20Scaling/ Domain-Referenced%20Scaling. htm http:/ / www. visualstatistics. net/ Statistics/ Item%20Analysis%20CD%20Validity/ Item%20Analysis%20CD%20Validity. htm http:/ / en. wikibooks. org/ wiki/ Handbook_of_Management_Scales
Self-report inventory
311
Self-report inventory
Psychology
Basic types
Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social
Applied psychology
Applied behavior analysis Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport
Lists
Self-report inventory
312
Publications Research methods Theories Timeline Topics Psychology portal
A self-report inventory is a type of psychological test in which a person fills out a survey or questionnaire with or without the help of an investigator. Self-report inventories often ask direct questions about symptoms, behaviors, and personality traits associated with one or many mental disorders or personality types in order to easily gain insight into a patient's personality or illness. Most self-report inventories can be taken or administered within five to 15 minutes, although some, like the Minnesota Multiphasic Personality Inventory (MMPI), can take up to three hours to fully complete. There are three major approaches to developing self-report inventories: theory-guided, factor analysis, and criterion-key. Theory-guided inventories are constructed around a theory of personality. Criterion-keyed inventories are based around questions that have been shown to statistically discriminate between a control group and a criterion group. Questionnaires typically use one of three formats: a Likert scale, true-false, or forced choice. True-false involves questions that the individual denotes as either being true or false about themselves. Forced-choice is a pair of statements that require the individual to choose one as being most representative of themselves. Self-report inventories can have validity problems. Patients may exaggerate symptoms in order to make their situation seem worse, or they may under-report the severity or frequency of symptoms in order to minimize their problems. Another issue is the social desirability bias.
Self-report inventory Personality Inventory for Children-2 Revised NEO Personality Inventory State-Trait Anxiety Inventory
313
References
Aiken, L.R. (2002) "Psychological Testing and Assessment." New York: Allyn & Bacon Gregory, R.J. (2007) "Psychological Testing: History, Principles, and Applications (5th ed.)" Boston: Pearson Education
Semantic differential
314
Semantic differential
Semantic differential
Diagnostics
Fig. 1. Modern Japanese version of the Semantic Differential. The Kanji characters in background stand for "God" and "Wind" respectively, with the compound reading "Kamikaze". (Adapted from Dimensions of Meaning. Visual Statistics Illustrated at VisualStatistics.net.) MeSH D012659 [1]
Semantic differential is a type of a rating scale designed to measure the connotative meaning of objects, events, and concepts. The connotations are used to derive the attitude towards the given object, event or concept.
Semantic differential
Osgood's semantic differential was designed to measure the connotative meaning of concepts. The respondent is asked to choose where his or her position lies, on a scale between two bipolar adjectives (for example: "Adequate-Inadequate", "Good-Evil" or "Valuable-Worthless"). Semantic differentials can be used to describe not only persons, but also the connotative meaning of abstract conceptsa capacity used extensively in affect control theory.
Theoretical background
Nominalists and realists
Theoretical underpinnings of Charles E. Osgood's semantic differential have roots in the medieval controversy between the nominalists and realists.[citation needed] Nominalists asserted that only real things are entities and that abstractions from these entities, called universals, are mere words. The realists held that universals have an independent objective existence either in a realm of their own or in the mind of God. Osgoods theoretical work also bears affinity to linguistics and general semantics and relates to Korzybski's structural differential.[citation needed]
Use of adjectives
The development of this instrument provides an interesting insight into the border area between linguistics and psychology. People have been describing each other since they developed the ability to speak. Most adjectives can also be used as personality descriptors. The occurrence of thousands of adjectives in English is an attestation of the subtleties in descriptions of persons and their behavior available to speakers of English. Roget's Thesaurus is an early attempt to classify most adjectives into categories and was used within this context to reduce the number of adjectives to manageable subsets, suitable for factor analysis.
Semantic differential
315
Usage
The semantic differential is today one of the most widely used scales used in the measurement of attitudes. One of the reasons is the versatility of the items. The bipolar adjective pairs can be used for a wide variety of subjects, and as such the scale is nicknamed "the ever ready battery" of the attitude researcher.[3]
Statistical properties
Five items, or 5 bipolar pairs of adjectives, have been proven to yield reliable findings, which highly correlate with alternative measures of the same attitude [4] The biggest problem with this scale is that the properties of the level of measurement are unknown.[5] The most statistically sound approach is to treat it as an ordinal scale, but it can be argued that the neutral response (i.e. the middle alternative on the scale) serves as an arbitrary zero point, and that the intervals between the scale values can be treated as equal, making it an interval scale. A detailed presentation on the development of the semantic differential is provided in the monumental book, Cross-Cultural Universals of Affective Meaning.[6] David R. Heise's Surveying Cultures[7] provides a contemporary update with special attention to measurement issues when using computerized graphic rating scales.
Notes
[1] [2] [3] [4] [5] [6] [7] http:/ / www. nlm. nih. gov/ cgi/ mesh/ 2011/ MB_cgi?field=uid& term=D012659 Himmelfarb (1993) p 56 Himmelfarb (1993) p 57 Osgood, Suci and Tannebaum (1957) Himmelfarb (1993) p 57 Osgood, May, and Miron (1975) Heise (2010)
Semantic differential
316
References
Heise, David R. (2010). Surveying Cultures: Discovering Shared Conceptions and Sentiments. Hoboken NJ: Wiley Himmelfarb, S. (1993). The measurement of attitudes. In A.H. Eagly & S. Chaiken (Eds.), Psychology of Attitudes, 23-88. Thomson/Wadsworth Krus, D.J., & Ishigaki, Y. (1992) Kamikaze pilots: The Japanese and the American perspectives. Psychological Reports, 70, 599-602. (Request reprint). (http://www.visualstatistics.net/Readings/Kamikaze Pilots/Kamikaze Pilots.html) Osgood, C. E., May, W. H., and Miron, M. S. (1975) Cross-Cultural Universals of Affective Meaning. Urbana, IL: University of Illinois Press Osgood, C.E., Suci, G., & Tannenbaum, P. (1957) The measurement of meaning. Urbana, IL: University of Illinois Press Snider, J. G., and Osgood, C. E. (1969) Semantic Differential Technique: A Sourcebook. Chicago: Aldine.
External links
Osgood's Semantic Space (http://www.writing.ws/reference/history.htm) On-line Semantic Differential (http://www.indiana.edu/~socpsy/papers/AttMeasure/attitude..htm)
Theory
As in classical hypothesis testing, SPRT starts with a pair of hypotheses, say alternative hypothesis respectively. They must be specified as follows: and for the null hypothesis and
The next step is calculate the cumulative sum of the log-likelihood ratio,
The stopping rule is a simple thresholding scheme: : continue monitoring (critical inequality) : Accept : Accept ) depend on the desired type I and type II errors, and . They may be
Sequential probability ratio test In other words, and must be decided beforehand in order to set the thresholds appropriately. The numerical
317
value will depend on the application. The reason for using approximation signs is that, in the discrete case, the signal may cross the threshold between samples. Thus, depending on the penalty of making an error and the sampling frequency, one might set the thresholds more aggressively. Of course, the exact bounds may be used in the continuous case.
Example
A textbook example is parameter estimation of a probability distribution function. Let us consider the exponential distribution:
and
, with
The thresholds are simply two parallel lines with slope samples makes an excursion outside the continue-sampling region.
Applications
Manufacturing
The test is done on the proportion metric, and tests that a variable p is equal to one of two desired points, p1 or p2. The region between these two points is known as the indifference region (IR). For example, suppose you are performing a quality control study on a factory lot of widgets. Management would like the lot to have 3% or less defective widgets, but 1% or less is the ideal lot that would pass with flying colors. In this example, p1 = 0.01 and p2 = 0.03 and the region between them is the IR because management considers these lots to be marginal and is OK with them being classified either way. Widgets would be sampled one at a time from the lot (sequential analysis) until the test determines, within an acceptable error level, that the lot is ideal or should be rejected.
318
References
[2] Ferguson, Richard L. (1969). The development, implementation, and evaluation of a computer-assisted branched test for a program of individually prescribed instruction (http:/ / eric. ed. gov/ ERICWebPortal/ custom/ portlets/ recordDetails/ detailmini. jsp?_nfpb=true& _& ERICExtSearch_SearchValue_0=ED034406& ERICExtSearch_SearchType_0=no& accno=ED034406). Unpublished doctoral dissertation, University of Pittsburgh. [3] Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press.
Holger Wilker: Sequential-Statistik in der Praxis, BoD, Norderstedt 2012, ISBN 978-3848232529.
SESAMO
319
SESAMO
SESAMO is the acronym of Sexrelation Evaluation Schedule Assessment Monitoring, is an Italian psychometric and psychological standardised and validated questionnaire (see Tab. 1) to examine single and couple aspect life, sexuality, interpersonal and intimate relationship.[1]
Features
As many others sexological tests, a female and a male version are available and both are made up of three sections (see Tab. 2): the first section contains items which investigate those areas relating to previous sexuality aspects; the subjects' social, environmental and personal features, health history and their BMI (Body Mass Index). After filling in this first section, all the subjects will be sent to either the second or third subsection depending on their affective-relational condition, which is defined as single condition or couple condition respectively. The second section collects all those items whose research areas relate to present sexuality and motivational aspects. This section is intended for single people, i.e. people lacking a stable sexual-affective relationship with a partner. The third section includes those areas which investigate the subjects' present sexuality and relational aspects within the couple. This section is intended for the dyadic condition, i. e. a sexual-affective relationship which is going on for at least six months.
Contents
The two versions (male/female) and their subsection (single/couple) of the questionnaire, contain 135 items for male and female single people, and 173 for males and females with a partner respectively. This method allows to detect dysfunctional sexual and relational aspects in singles people and people with a partner, aiming at two main goals: defining a psychosexual and social-affective profile as an "idiographic image" of the subject;[2] putting forward hypotheses about the dysfunctional aspects in individual and couple sexuality and their causes.
Cronbach' Alpha SESAMO questionnaire Single condition Couple condition Male 0.710 0.771 Female 0.696 0.700
Assessment
The assessment essentially aims at those areas concerning previous and present sexuality and, at the same time, it takes into consideration all those elements that, even indirectly, could have affected the development, expression and display of personality, affectivity and relationality (interpersonal and intimate relationships). The questionnaire takes into consideration the following areas (as shown on Tab. 2): social environmental data, psychosexual identity, sphere of pleasure (sex play, paraphilias), previous and present masturbation, previous sexual experiences, affective-relational condition, sexual intercourse, imaginative eroticism, contraception, relational attitude; additional areas are intended only for subjects with a partner: couple interaction, communicativeness within the sexual sphere, roles within the couple and extrarelational sexuality (i.e. outside couple
SESAMO sexuality).
320
Domains SESAMO questionnaire Section 1 General part Social environmental data Body image Psychosexual identity Desire Sphere of pleasure (paraphilias) Previous masturbation Previous sexual experiences Medical anamnesis Motivation and conflicts Total domains Section 2 Single condition Single situation Pleasure Sexual intercourses Present masturbation Imaginative eroticism Contraception Relational attitude Section 3 Couple condition Couple interaction Sexual intercourses Present masturbation Imaginative eroticism Communicativeness sexual sphere Roles within the couple Extrarelational sexuality Sexuality and pregnancy Contraception
Methodology
The SESAMO_Win methodology is provided with a software for administering the questionnaire and creating a multifactorial multilevel evaluation Report. This software analyses and decodes the answers obtained through direct administration on the computer or entered into the computer from printed forms and produces an anamnestic report about the subjects' sexual and relational condition. Once the administration has been completed, the software does not allow the questionnaire and its respective report to be altered or manipulated. This is necessary for deontological reasons and, above all, to assure its validity in legal appraisals and screenings. The software processes a report for each questionnaire. Each report can be displayed on the computer monitor or printed out. It is also possible to print out the whole report or its single parts.
Anamnestic report
The report is divided in 9 parts: 1. Heading It contains the subject's identification data and some directions for using the information in the report properly (interpretations, inferences and indications provided by the report). 2. Personal data and household It displays a summary of personal data, BMI (Body mass index), the starting and finishing time of the administration, the time required to fill in the questionnaire, the composition of the household, the present affective-relational condition and off-the-cuff comments from the subject at the end of the administration.
SESAMO
321
3. Scoring diagram for each area A diagram displays a comparative summary of the scores obtained by the subject in each area of analysis (it could be defined as a snapshot of the subject's sexual-relational condition). The right side of the diagram (displaying positive scores) indicates an hypothesis about the degree of discomfort/dysfunction for each area. 4. Critical traits The critical traits section highlights the most relevant and significative features of the subject's condition and his/her sexual-relational fields. These indications allow to get some relevant hints to be used in prospective in-depth medical, psychological, psychiatric interviews. 5. Narrative report It tells in a narrative and detailed way the subject's sexual-relational history, through the explanations and comments he/she made while completing the questionnaire. 6. Further diagnostic examinations and specialist examinations It gives some brief indications about those focal points which need to be addressed and carefully considered, besides it suggests prospective specialist examinations and counselling. 7. Parameters for the items and subliminal indexes This section of the report displays, as well as the topic relative to each question, the indexes of subliminal factors measured on the subject and the significance degree of the answers he/she has chosen for each item. Go-back index (it shows that the subject went back to previous items due to rethinking/rumination); Try-jump index (it reveals an attempt to jump or leave out the answer to an item); Significance index (or weight) of the answers chosen by the subject for each item; Latency time index for each item (measured for each answer); Kinetic reaction index of the subject (emotional motility measured for each item).
8. The score for each area displays: a descriptive heading of the fields of investigation relative to the subject's affective-relational condition (single or couple); the number of the omitted answers for each area (this option is activated only when entering the answers into the computer from a paper questionnaire); the rough points obtained by the subject for each area; the Z scores (standard scores) for each area and their relative percentile ranks. 9. Completed questionnaire This section displays all the answers chosen and entered into the computer by the subject while completing the questionnaire; as well as being a documental report (official certificate), it can be used in personalised close examinations and to obtain the open answers entered through the keyboard by the subject.
SESAMO
322
Criticism
The disadvantages of this device are the time required for filling in the questionnaire (3060 minutes) and the fact that the complete report can be elaborated only by the software. A reduced version of the questionnaire has less items but can be administered and scored through the paper and pencil method. A clinical research that has used the brief version, expresses this: "During follow-up each patient received the SESAMO test (Sexuality Evaluation Schedule Assessment Monitoring) in the standard clinical form, with the end point of tracking down the sexual, affective, and relationship profile of each Htx pts [3] [...]. The SESAMO questionnaire is based on topics relative to male and female sexuality in mates situation. Topics are grouped in two section: the first one collects data on former sexuality, health history, and social behavior; the second one looks at the mate's relationship to show any situation revealing sexual worries. The questionnaire gives values based on a survey of 648 people with characteristics quite similar to the Italian population. The clinical test for mates is based on 81 items for males and 85 items for females. The row score for each topic is modified in standard scores. The exceeding of scores over a specified threshold gives concise information for diagnostic purpose".[4]
Notes
[1] Note. The test is available only for professional psychologists and physicians. [2] In psychology, an "idiographic image" (it:Immagine idiografica) is the representation of a study or research whose subjects are specific cases, thus avoiding generalizations. The idiographic method (also called historical method) is a criterion that involves evaluating past experiences, selecting and comparing information about a specific individual or event. [3] Note. Htx pts = cardiotransplanted patients. [4] Basile A. et al., Sexual Disorders After Heart Transplantation. Elsevier Science Inc., New York, Vol. 33, Issue 1, 2001.
Bibliography
Basile Fasolo C., Veglia F., Disturbi sessuali, in Conti L. (1999), Repertorio delle scale di valutazione in psichiatria, S.E.E. Edizioni Medico Scientifiche, Firenze. (http://www.pol-it.org/ital/scale/cap13-3.htm). Boccadoro L., Carulli S., (2009) Il posto dell'amore negato. Sessualit e psicopatologie segrete ( The place of the denied love. Sexuality and secret psychopathologies - Abstract (http://www.sexology.it/abstract_english. html)). Tecnoprint Editrice, Ancona. ISBN 978-88-95554-03-7 Boccadoro L., (2002) Sesamo_win: Sexrelation Evaluation Schedule Assessment Monitoring, Giunti O.S., Florence (Italy). it:SESAMO (test) Boccadoro L., (1996) SESAMO: Sexuality Evaluation Schedule Assessment Monitoring, Approccio differenziale al profilo idiografico psicosessuale e socioaffettivo, Organizzazioni Speciali, Firenze. IT\ICCU\CFI\0327719 (http:// www.giuntios.it/scheda_sesamo_eng.jsp) Brunetti M., Olivetti Belardinelli M. et al., Hypothalamus, sexual arousal and psychosexual identity in human males: a functional magnetic resonance imaging study. European Journal of Neuroscience, Vol. 27, 11, 2008. Calabr R.S., Bramantia P. et al., Topiramate-induced erectile dysfunction. Epilepsy & Behavior, 14, 3, 2009. Capodieci S. et al., (1999) SESAMO: una nuova metodica per l'assessment sessuorelazionale. In: Cociglio G., et al. (a cura di), La coppia, Franco Angeli, Milano. ISBN 88-464-1491-8 Dess A., Conte S., Men as well have problems with their body image and with sex. A study on men suffering from eating disorders. Sexologies, 17, 1, 2008. Dttore D., (2001) Psicologia e psicopatologia del comportamento sessuale, McGraw-Hill, Milano. ISBN
88-386-2747-9
Ferretti A., Caulo M., Del Gratta C. et al., Dynamics of Male Sexual Arousal: Distinct Components of Brain Activation Revealed by fMRI. Neuroimage, 26, 4, 2005.
SESAMO Natale V., Albertazzi P., Zini M., Di Micco R., Exploration of cyclical changes in memory and mood in postmenopausal women taking sequential combined oestrogen and progestogen preparations. British Journal of Obstetrics and Gynaecology. Vol. 108, 286-290, 2001. Ugolini V., Baldassarri F., Valutazione della vita sessuorelazionale in uomini affetti da sterilit attraverso il SESAMO. In Rivista di Sessuologia, vol.25, n.4, 2001. Vignati R. et al., Un nuovo test per lindagine sessuale. In Journal of Sexological Sciences - Rivista Scienze Sessuologiche, Vol.11 n.3, 1998. Vignati R La valutazione del disagio nellapproccio ai disturbi sessuorelazionali PSYCHOMEDIA, 2010 http:// www.psychomedia.it/pm/grpind/family/vignati.htm
323
Basic types
Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive Quantitative Social
Applied psychology
Applied behavior analysis Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal
324
Military Occupational health Political Religion School Sport
Lists
Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal
Situational judgment tests (SJTs) or Inventories (SJIs) are a type of psychological test which present the test-taker with realistic, hypothetical scenarios and ask the individual to identify the most appropriate response or to rank the responses in the order they feel is most effective.[] SJTs can be presented to test-takers through a variety of modalities, such as booklets, films, or audio recordings.[1] SJTs represent a distinct psychometric approach from the common knowledge-based multiple choice item.[][] They are often used in industrial-organizational psychology applications such as personnel selection. Situational judgment tests tend to determine behavioral tendencies, assessing how an individual will behave in a certain situation, and knowledge instruction, which evaluates the effectiveness of possible responses.[] Situational judgment tests could also reinforce the status quo with an organization.[] Unlike most psychological tests SJTs are not acquired 'off-the-shelf', but are in fact designed as a bespoke tool, tailor-made to suit the individual role requirements.[] This is because SJTs are not a type of test with respect to their content, but are a method of designing tests.
325
Validity
The validity of the test corresponds to the types of questions are being asked. Knowledge instruction questions correlate more highly with general mental ability while behavioral tendency questions correlate more highly with personality.[] Key results from a study show that knowledge about interpersonal behavior measured with situational judgment tests was valid for internships (7 years later) as well as job performance (9 years later). Also, students' knowledge of interpersonal behavior showed progressive validity over cognitive factors for predicting academic and post academic success. This study was also the first study to show evidence of the predictive long-term power of interpersonal skill assessed though situational judgment tests.[3] There are many problems within scoring SJTs. "Attempts to address this issue include expert-novice differences, where an item is scored in the direction favoring the experts after the average ratings of experts and novices on each item are compared; expert judgment, where a team of experts decides the best answer to each question; target scoring, where the test author determines the correct answer; and consensual scoring, where a score is allocated to each option according to the percentage of people choosing that option." [4]
History
The situational judgment test has been around for over fifty years. The first two that were documented were the How supervise and the Cardall Practical Judgment Test. In 1958 the Supervisory Practice Test came about by Bruce and Learner.[] The Supervisory Practice Test was to point out whether or not supervisors could handle certain situations on the job. This test is said to effectively identify who could and could not be a supervisor.[] The situational judgment test did not really take off and become a great use in the employment field until the early 1990s.[] Situational Judgment Tests then went on to be used in World War II by psychologists in the US military.[] "In the 1950s and 60s, their use was extended to predict, as well as assess,managerial success." [5] Today, SJTs are used in many organizations, are promoted by various consulting firms, and are researched by many.[]
326
Multiple-choice Examples
Consist of either taking the test on paper or written out examples online. The online version offers a few advantages such as, faster results and better quality. Whereas traditional Multiple-choice questions have only one correct answer, it is often the case that Situational Judgment Test have multiple correct answers even though an answer might be more preferred by the hiring organization.[] You are the leader of a manufacturing team that works with heavy machinery. One of your productions operators tells you that one machine in the work area is suddenly malfunctioning and may endanger the welfare of your work team. Rank order the following possible courses of action to effectively address this problem. from most desirable to least desirable. 1. Call a meeting of your team members to discuss the problem. 2. Report the problem to the Director of Safety 3. shut off the machine immediately. 4. Individually ask other production operators about problems with their machines. 5.evacuate your team from the production facility[]
Video-based Examples
Consists of videos that contain different scenarios that the employee may face. Scenarios for this section can be found on youtube.com. Scenarios are in many different styles such as: Animated people and situations. The boss of the company could be recorded asking the question. The answering process can be different for each test. * The correct answer could be given. * The individual could be ask to give the most reasonable answer. * The individual is asked to explain what they were to do if they were in that situation.
327
Company Use
Companies using SJTs report the following anecdotal evidence supporting the use of SJT. Note: these reports are not supported by peer reviewed research. Can highlight employee developmental needs[] They are relatively easy and cost-effective to develop, administer and score[] There has been more favorable applicant reactions to this test than to general mental ability tests.
Criticisms
The scenarios in many SJTs tend to be brief; therefore candidates do not become fully immersed in the scenario. This can remove some of the intended realism of the scenario and may reduce the quality and depth of assessment.[] SJT responses can be transparent, providing more of an index of best practice knowledge in some cases and therefore failing to differentiate between candidates' work-related performance.[] The response formats in some SJTs do not present a full enough range of responses to the scenario. Candidates can be forced to select actions or responses that do not necessarily fit their behavior. They can find this frustrating and this can affect the validity of such measures[10][11][12] Because of the adaptability of SJTs, arguments persist about whether or not they are a valid measurement of a particular construct (Job Knowledge), or a measurement tool which can be applied to a variety of different constructs, such as cognitive ability, conscientiousness, agreeableness, or emotional stability [13] SJTs are best suited for assessing multiple constructs, and as such, it is difficult to separate the constructs assessed in the test. If one construct is of particular interest, a different measure may be more practical.[14] Due to the multi-dimensional nature of SJTs, it is problematic to assess reliability through the use of standard measures.[15]
Sample tests
Europa.eu Wikipedia:Identifying reliable sourcesSJT [16] (four questions with answers and scoring example) Assessmentday.com Wikipedia:Identifying reliable sourcesSJT [17] (four questions) Abilitus.com Wikipedia:Identifying reliable sourcesSJT [18] (Free Demo of Situational Judgement Tests - 5 questions in English and French - Many Practice Tests - Very useful for EPSO competition) Practise business situational judgement test Wikipedia:Identifying reliable sources [19] (takes 30 minutes with feedback) Blog on Situational Judgement SJT [20]Wikipedia:Identifying reliable sources (practice SJ tests on iPhone and iPad, samples, hints) Demo test on situational judgement [21](methodology, tests and corrected tests)
328
Notes
[4] http:/ / eprints. usq. edu. au/ 787/ 1/ Strahan_Fogarty_Machin_APS_Conference_proceedings. pdf [5] http:/ / eprints. usq. edu. au/ 787/ [6] Hoare, S., Day, A., & Smith, M. (1998). The development and evaluation of situations inventories. Selection & Development Review, 14(6), 3-8. [7] Motowildo, S.J., Hanson, M.A., & Crafts, J.L. (1997). Low fidelity simulations. In D.L. Whetzel & G.R. Wheaton (Eds.), Applied Measurement in industrial Psychology. Palo Alto, CA: Davies-Black. [8] McDaniel, Michael. & Nguyen, Nhung "Situational Judgement Tests: A Review of Practice and Constructs Assessed" (http:/ / www. people. vcu. edu/ ~mamcdani/ Publications/ McDaniel & Nguyen 2001 IJSA. pdf), Blackwell Publishers LTD, Oxford, March/June 2001. Retrieved on 17 October 2012. [10] Chan, D., & Schmitt, N. (2005). An agenda for future research on applicants' reactions to selection procedures: A construct-orientated approach. International Journal of Selection and Assessment, 12, 9-23. [11] Ployhart, R.E., & Harold, C.M. (2004). The applicant attribution-reaction theory (AART): An integrative approach of applicant attributional processing. International Journal of Selection & Assessment, 12, 84-98. [12] Schmit, M.J., & Ryan, A.M. (1992). Test-taking dispositions: A missing link? Journal of Applied Psychology, 77, 629-637. [13] McDaniel, M.A., Morgeson, F.P., Finnegan, E.B., Campion, M.A., & Braverman, E.P. (2001). Use of situational judgment tests to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86, 730-740.001 [14] McDaniel, M.A., Morgeson, F.P., Finnegan, E.B., Campion, M.A., & Braverman, E.P. (2001). Use of situational judgment tests to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86, 730-740. [15] McDaniel, M.A. & Whetzel, D.L. (2007). Situational Judgement Tests. In D.L. Whetzel & G.R. Wheaton (Eds). Applied measurement: Industrial psychology in human resources management. Erlbaum. 235-258. [16] [17] [18] [19] [20] [21] http:/ / europa. eu/ epso/ discover/ prepa_test/ sample_test/ index_en. htm#chapter2/ http:/ / www. assessmentday. co. uk/ situational-judgement-test/ http:/ / www. abilitus. com/ https:/ / www. surveymonkey. com/ s/ BusinessSituations http:/ / situationaljudgement. blogspot. be/ http:/ / / www. orseu-concours. com/ en/ run_test. php?test=demo
Psychometric software
Psychometric software is software that is used for psychometric analysis of data from tests, questionnaires, or inventories reflecting latent psychoeducational variables. While some psychometric analyses can be performed with standard statistical software like SPSS, most analyses require specialized tools.[citation needed]
Sources
Because only a few commercial businesses (most notably Assessment Systems Corporation and Scientific Software International) develop specialized psychometric tools, there exist many free tools developed by researchers and educators. Important websites for free psychometric software include: CASMA at the University of Iowa, USA [1] REMP at the University of Massachusetts, USA [2] Software from Brad Hanson [3] Software from John Uebersax [4] Software from J. Patrick Meyer [5] Software directory at the Institute for Objective Measurement [4]
Psychometric software
329
CITAS
CITAS (Classical Item and Test Analysis Spreadsheet) is a free Excel workbook designed to provide scoring and statistical analysis of classroom tests. Item responses (ABCD) and keys are typed or pasted into the workbook, and the output automatically populates; unlike other programs, CITAS does not require any "running" or experience in psychometric analysis, making it accessible to school teachers and professors. It is available for free download here [6] .
jMetrik
jMetrik [7] is free and open source software for conducting a comprehensive psychometric analysis. It was developed by J. Patrick Meyer at the University of Virginia. Current methods include classical item analysis, differential item functioning (DIF) analysis, confirmatory factor analysis, item response theory, IRT equating, and nonparametric item response theory. The item analysis includes proportion, point biserial, and biserial statistics for all response options. Reliability coefficients include Cronbach's alpha, Guttman's lambda, the Feldt-Gilmer Coefficient, the Feldt-Brennan coefficient, decision consistency indices, the conditional standard error of measurement, and reliability if item deleted. The DIF analysis is based on nonparametric item characteristic curves and the Mantel-Haenszel procedure. DIF effect sizes and ETS DIF classifications are included in the output. Confirmatory factor analysis is limited to the common factor model for congeneric, tau-equivalent, and parallel measures. Fit statistics are reported along with factor loadings and error variances. IRT methods include the Rasch, partial credit, and rating scale models. IRT equating methods include mean/mean, mean/sigma, Haebara, and Stocking-Lord procedures. jMetrik also include basic descriptive statistics and a graphics facility that produces bar charts, pie chart, histograms, kernel density estimates, and line plots. jMetrik is a pure Java application that runs on 32-bit and 64-bit versions of Windows, Mac, and Linux operating systems. jMetrik requires Java 1.6 on the host computer. jMetrik is available as a free download from www.ItemAnalysis.com [7].
Iteman
Iteman is a commercial program specifically designed for classical test analysis, producing rich text (RTF) reports with graphics, narratives, and embedded tables. It calculates the proportion and point biserial of each item, as well as high/low subgroup proportions, and detailed graphics of item performance. It also calculates typical descriptive statistics, including the mean, standard deviation, reliability, and standard error of measurement, for each domain and the overall tests. It is only available from Assessment Systems Corporation [8].
Lertap
Lertap (Laboratory of Educational Research Test Analysis Program) is a comprehensive software package for classical test analysis developed for use with Microsoft Excel. It includes test, item, and option statistics, classification consistency and mastery test analysis, procedures for cheating detection, and extensive graphics (e.g., trace lines for item options, conditional standard errors of measurement, scree plots, boxplots of group differences, histograms, scatterplots). DIF, differential item functioning, is supported in the Excel 2007, Excel 2010, Excel 2011 (Macintosh), and Excel 2013 versions of Lertap. Mantel-Haenszel methods are used; graphs of results are provided.
Psychometric software Lertap will produce ASCII data files ready for input to Xcalibre and Bilog MG. Several sample datasets for use with Lertap and/or other item and test analysis programs are available [9]; these involve both cognitive tests, and affective (or rating) scales. Technical papers related to the application of Lertap are also available [10]. Lertap was developed by Larry Nelson at Curtin University; commercial versions are available from Assessment Systems Corporation [11].
330
TAP
TAP (the Test Analysis Program) is a free program for basic classical analysis developed by Gordon Brooks at Ohio University. It is available here [12].
ViSta-CITA
ViSta-CITA (Classical Item and Test Analysis) is a module included in the Visual Statistics System (ViSta) that focuses on graphical-oriented methods applied to psychometric analysis. It is freely available at [13]. It was developed by Ruben Ledesma, J. Gabriel Molina, Pedro M. Valero-Mora, and Forrest W. Young.
BILOG-MG
BILOG-MG is a software program for IRT analysis of dichotomous (correct/incorrect) data, including fit and differential item functioning. It is commercial, and only available from Scientific Software International [14] or Assessment Systems Corporation [15].
Facets
Facets is a software program for Rasch analysis of rater- or judge-intermediated data, such as essay grades, diving competitions, satisfaction surveys and quality-of-life data. Other applications include rank-order data, binomial trials and Poisson counts. For availability, see Software directory at the Institute for Objective Measurement [4].
flexMIRT
flexMIRT is a new multilevel and multiple group IRT software package for item analysis and test scoring. This IRT software package fits a variety of unidimensional and multidimensional item response theory models (also known as item factor analysis models) to single-level and multilevel data in any number of groups. It is available from Vector Psychometric Group, LLC [16].
Psychometric software
331
ICL
ICL (IRT Command Language) performs IRT calibrations, including the 1, 2, and 3 parameter logistic models as well as the partial credit model and generalized partial credit model. It can also generate response data. As the name implies, it is completely command code driven, with no graphical user interface. It is available for free download here [17].
jMetrik
jMetrik [7] is free and open source software for conducting a comprehensive psychometric analysis. It was developed by J. Patrick Meyer at the University of Virginia. Current methods include classical item analysis, differential item functioning (DIF) analysis, confirmatory factor analysis, item response theory, IRT equating, and nonparametric item response theory. The item analysis includes proportion, point biserial, and biserial statistics for all response options. Reliability coefficients include Cronbach's alpha, Guttman's lambda, the Feldt-Gilmer Coefficient, the Feldt-Brennan coefficient, decision consistency indices, the conditional standard error of measurement, and reliability if item deleted. The DIF analysis is based on nonparametric item characteristic curves and the Mantel-Haenszel procedure. DIF effect sizes and ETS DIF classifications are included in the output. Confirmatory factor analysis is limited to the common factor model for congeneric, tau-equivalent, and parallel measures. Fit statistics are reported along with factor loadings and error variances. IRT methods include the Rasch, partial credit, and rating scale models. IRT equating methods include mean/mean, mean/sigma, Haebara, and Stocking-Lord procedures. jMetrik also include basic descriptive statistics and a graphics facility that produces bar charts, pie chart, histograms, kernel density estimates, and line plots. jMetrik is a pure Java application that runs on 32-bit and 64-bit versions of Windows, Mac, and Linux operating systems. jMetrik requires Java 1.6 on the host computer. jMetrik is available as a free download from www.ItemAnalysis.com [7].
MULTILOG
MULTILOG is an extension of BILOG to data with polytomous (multiple) responses. It is commercial, and only available from Scientific Software International [14] or Assessment Systems Corporation [18].
PARSCALE
PARSCALE is a program designed specifically for polytomous IRT analysis. It is commercial, and only available from Scientific Software International [14] or Assessment Systems Corporation [19].
PARAM-3PL
PARAM-3PL [20] is a free program for the calibration of the 3-parameter logistic IRT model. It was developed by Lawrence Rudner at the Education Resources Information Center (ERIC). The latest release was version 0.89 in June 2007. It is available from ERIC here [21].
TESTFact
Testfact features [22] - Marginal maximum likelihood (MML) exploratory factor analysis and classical item analysis of binary data - Computes tetrachoric correlations, principal factor solution, classical item descriptive statistics, fractile tables and plots - Handles up to 10 factors using numerical quadrature: up to 5 for non-adaptive and up to 10 for adaptive quadrature - Handles up to 15 factors using Monte Carlo integration techniques - Varimax (orthogonal) and PROMAX (oblique) rotation of factor loadings - Handles an important form of confirmatory factor analysis known as "bifactor" analysis: Factor pattern consists of one main factor plus group factors - Simulation of responses
Psychometric software to items based on user specified parameters - Correction for guessing and not-reached items - Allows imposition of constraints on item parameter estimates - Handles omitted and not-presented items - Detailed online HELP documentation includes syntax and annotated examples.
332
WINMIRA 2001
WINMIRA 2001 is a program for analyses with the Rasch model for dichotomous and polytomous ordinal responses, with the latent class analysis, and with the Mixture Distribution Rasch model for dichotomous [23] and polytomous item responses.[24] The software provides conditional maximum likelihood (CML) estimation of item parameters, as well as MLE and WLE estimates of person parameters, and person- and item-fit statistics as well as information criteria (AIC, BIC, CAIC) for model selection. The software also performs a parametric bootstrap procedure for the selection of the number of mixture components. A free student version is available from Matthias von Davier's webpage at http:/ / www. von-davier. com/ [25], a commercial version is available through ASSESS.COM at [26].
Winsteps
Winsteps is a program designed for analysis with the Rasch model, a one-parameter item response theory model which differs from the 1PL model in that each individual in the person sample is parameterized for item estimation and it is prescriptive and criterion-referenced, rather than descriptive and norm-referenced in nature.[27] It is commercially available from Winsteps, Inc. [28]. A previous DOS-based version, BIGSTEPS, is also available.
Xcalibre
XCalibre is a commercial program that performs marginal maximum likelihood estimation of both dichotomous (1PL-Rasch, 2PL, 3PL) and all major polytomous IRT models. The interface is point-and-click; no command code required. Its output includes both spreadsheets and a detailed, narrated report document with embedded tables and figures, which can be printed and delivered to subject matter experts for item review. It is only available from Assessment Systems Corporation [29].
IATA
IATA is a software package for analysing psychometric and educational assessment data. The interface is point-and-click, and all functionality is delivered through wizard-style interfaces that are based on different workflows or analysis goals, such as pilot testing or equating. IATA reads and writes csv, Excel and SPSS file formats, and produces exportable graphics for all statistical analyses. Each analysis also includes heuristics suggesting appropriate interpretations of the numerical results. IATA performs factor analysis, (1PL-Rasch, 2PL, 3PL) scaling and calibration, differential item functioning (DIF) analysis, (basic) computer aided test development, equating, IRT-based standard setting, score conditioning, and plausible value generation. It is available for free from Polymetrika International [30].
Psychometric software
333
eqboot
eqboot is an open source syntax-based Java application for conducting IRT equating and computing the bootstrap standard error of equating developed by J. Patrick Meyer. The program runs on any 32- or 64-bit operating system that has the Java Runtime Environment (JRE) version 1.6 or higher installed. At the moment, the programs only support equating with binary items. EQBOOT will compute equating constants using the mean/mean, mean/sigma, Haebara,[31] and Stocking-Lord[32] procedures. It will also compute the standard error of equating if the user provides a comma delimited file of bootstrapped item parameter estimates from both forms, a comma delimited file of bootstrapped ability estimates for Form X examinees, and a comma delimited file of bootstrapped ability estimates for Form Y examinees. Options allow the user to specify the criterion function for the Haebara and Stocking-Lord methods.[33] In addition, the examinee distribution over which the criterion function is minimized may be set to the observed theta estimates, a histogram of theta estimates, a kernel density estimate of theta estimates, or uniformly spaced values on the theta scale. The software is a free download from www.ItemAnalysis.com [34].
IRTEQ
IRTEQ [35] is a freeware Windows GUI application that implements IRT scaling and equating developed by Kyung (Chris) T. Han. It implements IRT scaling/equating methods that are widely used with the Non-Equivalent Groups Anchor Test design: Mean/Mean,[36] Mean/Sigma,[37] Robust Mean/Sigma,[38] and TCC methods.[39][40] For TCC methods, IRTEQ provides the user with the option to choose various score distributions for incorporation into the loss function. IRTEQ supports various popular unidimensional IRT models: Logistic models for dichotomous responses (with 1, 2, or 3 parameters) and the Generalized Partial Credit Model (GPCM) (including Partial Credit Model (PCM), which is a special case of GPCM) and Graded Response Model (GRM) for polytomous responses. IRTEQ can also equate test scores on the scale of a test to the scale of another test using IRT true score equating.[41]
ResidPlots-2
ResidPlots-2 [42] is a free program for IRT graphical residual analysis. It was developed by Tie Liang, Kyung (Chris) T. Han, and Ronald K. Hambleton at the University of Massachusetts Amherst.
WinGen
WinGen [43] is a free Windows-based program that generates IRT parameters and item responses. Kyung (Chris) T. Han at the University of Massachusetts Amherst.[44]
ST
ST [1] conducts item response theory (IRT) scale transformations for dichotomously scored tests.
Psychometric software
334
POLYST
POLYST [1] conducts IRT scale transformations for dichotomously and polytomously scored tests.
STUIRT
STUIRT [1] conducts IRT scale transformations for mixed-format tests (tests that include some multiple choice items and some polytomous items).
Decision consistency
Decision consistency methods are applicable to criterion-referenced tests such as licensure exams and academic mastery testing.
Iteman
Iteman [8] provides an index of decision consistency as well as a classical estimate of the conditional standard error of measurement at the cutscore, which is often requested for accreditation of a testing program.
jMetrik
jMetrik [7] is free and open source software for conducting a comprehensive psychometric analysis. Detailed information is listed above. jMetrik includes Huynh's decision consistency estimates if cut-scores are provided in the item analysis.
Lertap
Lertap [45] calculates several statistics related to decision and classification consistency, including the Brennan-Kane dependability index, kappa, and an estimate of p(0) derived by using the Peng-Subkoviac adaptation of Huynh's method. More detailed information concerning Lertap is provided above, under 'Classical test theory'.
R
R is a programming environment designed for statistical computing and production of graphics. It is freely available at [46]. Basic R functionality can be extended through installing contributed 'packages', and a list of psychometric related packages may be found at [47].
SPSS
SPSS, originally called the Statistical Package for the Social Sciences, is a commercial general statistical analysis program where the data is presented in a spreadsheet layout and common analyses are menu driven.
Psychometric software
335
S-Plus
S-Plus is a commercial analysis package based on the programming language S.
SAS
SAS is a commercially available package for statistical analysis and manipulation of data. It is also command-based.
References
[1] http:/ / www. education. uiowa. edu/ casma/ computer_programs. htm [2] http:/ / www. umass. edu/ remp/ main_software. html [3] http:/ / www. b-a-h. com/ [4] http:/ / john-uebersax. com/ stat/ papers. htm [5] http:/ / www. jMetrik. com [6] http:/ / www. assess. com/ xcart/ product. php?productid=522& cat=19& page=1 [7] http:/ / www. ItemAnalysis. com/ [8] http:/ / www. assess. com/ xcart/ product. php?productid=541 [9] http:/ / lertap. curtin. edu. au/ HTMLHelp/ Lrtp59HTML/ index. html [10] http:/ / lertap. curtin. edu. au/ Documentation/ Techdocs. htm [11] http:/ / assess. com/ xcart/ product. php?productid=235& cat=0& page=1 [12] http:/ / oak. cats. ohiou. edu/ ~brooksg/ tap. htm [13] http:/ / www. uv. es/ visualstats/ Book/ DownloadBook. htm [14] http:/ / www. ssicentral. com/ irt/ index. html [15] http:/ / www. assess. com/ xcart/ product. php?productid=217 [16] http:/ / flexmirt. vpgcentral. com/ [17] http:/ / www. b-a-h. com/ software/ irt/ icl/ index. html [18] http:/ / www. assess. com/ xcart/ product. php?productid=244& cat=0& page=1 [19] http:/ / www. assess. com/ xcart/ product. php?productid=248& cat=0& page=1 [20] http:/ / echo. edres. org:8080/ irt/ param/ [21] http:/ / edres. org/ irt/ param [22] http:/ / www. scienceplus. nl/ catalog/ testfact?vmcchk=1 [23] Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271-282. [24] von Davier, M., & Rost, J. (1995). Polytomous mixed Rasch models. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models, foundations, recent developments, and applications (pp. 371-382). New York: Springer. [25] http:/ / www. von-davier. com/ [26] http:/ / www. assess. com/ xcart/ product. php?productid=269& cat=0& page=1 [27] Rasch dichotomous model vs. One-parameter Logistic Model (http:/ / www. rasch. org/ rmt/ rmt193h. htm). Rasch Measurement Transactions (http:/ / www. rasch. org/ rmt/ ), 2005, 19:3 p. 1032 [28] http:/ / www. winsteps. com/ [29] http:/ / www. assess. com/ xcart/ product. php?productid=569 [30] http:/ / www. polymetrika. org/ IATA [31] Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144149. [32] Stocking, M.L., & Lord, F.M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210. [33] Kim, S., & Kolen, M. J. (2007). Effects on scale linking of different definitions of criterion functions for the IRT characteristic curve methods.Journal of Educational and Behavioral Statistics, 32, 371-397. [34] http:/ / www. ItemAnalysis. com [35] http:/ / www. umass. edu/ remp/ software/ irteq/ [36] Loyd & Hoover, 1980 [37] Marco, 1977 [38] Linn, Levine, Hastings, & Wardrop, 1981 [39] Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144149. [40] Stocking, M.L., & Lord, F.M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210. [41] Lord, F.M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. [42] http:/ / www. umass. edu/ remp/ software/ residplots/ [43] http:/ / www. umass. edu/ remp/ software/ wingen/
Psychometric software
[44] Han, K. T. (2007). WinGen: Windows software that generates IRT parameters and item responses. Applied Psychological Measurement, 31, 457-459. [45] http:/ / www. lertap. curtin. edu. au/ [46] http:/ / www. r-project. org/ [47] http:/ / cran. r-project. org/ web/ views/ Psychometrics. html
336
Calculation
Predicted reliability, , is estimated as:
predicts the reliability of a new test composed by replicating the current test N times (or, equivalently, creating a test with N parallel forms of the current exam). Thus N=2 implies doubling the exam length by adding items with the same properties as those in the current exam. Values of N less than one may be used to predict the effect of shortening a test.
337
Citations
[2] Stanley, J. (1971). Reliability. In R. L. Thorndike (Ed.), Educational Measurement. Second edition. Washington, DC: American Council on Education [3] Wainer, H., & Thissen, D. (2001). True score theory: The traditional method. In H. Wainer and D. Thissen, (Eds.), Test Scoring. Mahwah, NJ:Lawrence Erlbaum [4] Stanley, J. (1971). Reliability. In R. L. Thorndike (Ed.), Educational Measurement. Second edition. Washington, DC: American Council on Education
References
Spearman, Charles, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271295. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296322.
Standard-setting study
A standard-setting study is an official research study conducted by an organization that sponsors tests to determine a cutscore for the test. To be legally defensible in the USA and meet the Standards for Educational and Psychological Testing, a cutscore cannot be arbitrarily determined, it must be empirically justified. For example, the organization cannot merely decide that the cutscore will be 70% correct. Instead, a study is conducted to determine what score best differentiates the classifications of examinees, such as competent vs. incompetent. Standard-setting studies are often performed using focus groups of 5-15 subject matter experts that represent key stakeholders for the test. For example, in setting cut scores for educational testing, experts might be instructors familiar with the capabilities of the student population for the test.
Item-centered studies
The Angoff approach is very widely used.[1] This method requires the assembly of a group of subject matter experts, who are asked to evaluate each item and estimate the proportion of minimally competent examinees that would correctly answer the item. The ratings are averaged across raters for each item and then summed to obtain a panel-recommended raw cutscore. This cutscore then represents the score which the panel estimates a minimally competent candidate would get. This is of course subject to decision biases such as the overconfidence bias. Calibration with other - more objective - sources of data is preferable. The Bookmark method is another widely used item-centered approach. Items in a test (or a subset of them) are ordered by difficulty, and each expert places a "bookmark" in the sequence at the location of the cutscore.[2][3]
Standard-setting study
338
Person-centered studies
Rather than the items that distinguish competent candidates, person-centered studies evaluate the examinees themselves. While this might seem more appropriate, it is often more difficult because examinees are not a captive population, as is a list of items. For example, if a new test comes out regarding new content (as often happens in information technology tests), the test could be given to an initial sample called a beta sample, along with a survey of professional characteristics. The testing organization could then analyze and evaluate the relationship between the test scores and important statistics, such as skills, education, and experience. The cutscore could be set as the score that best differentiates between those examinees characterized as "passing" and those as "failing."
References
[1] Zieky, M.J. (2001). So much has changed: how the setting of cutscores has evolved since the 1980s. In Cizek, G.J. (Ed.), Setting Performance Standards, p. 19-52. Mahwah, NJ: Lawrence Erlbaum Associates. [2] Lewis, D. M., Mitzel, H. C., Green, D. R. (June, 1996). Standard Setting: A Bookmark Approach. In D. R. Green (Chair), IRT-Based Standard-Setting Procedures Utilizing Behavioral Anchoring. Paper presented at the 1996 Council of Chief State School Officers National Conference on Large Scale Assessment, Phoenix, AZ. [3] Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2000). The Bookmark Procedure: Cognitive Perspectives on Standard Setting. Chapter in Setting Performance Standards: Concepts, Methods, and Perspectives (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
339
Related standards
In 1974, the Joint Committee on Standards for Educational Evaluation was charged with the responsibility of writing a companion volume to the 1974 revision of the Standards for Educational and Psychological Tests. [1] This companion volume was to deal with issues and standards for program and curriculum evaluation in education. In 1975, the Joint Committee began work and ultimately decided to establish three separate sets of standards. These standards include The Personnel Evaluation Standards, The Program Evaluation Standards, and The Student Evaluation Standards.
External links
The Standards for Educational and Psychological Testing [4] (apa.org) The Standards for Educational and Psychological Testing [5] (aera.net)
References
[1] [2] [3] [4] [5] http:/ / en. wikipedia. org/ wiki/ Standards_for_Educational_and_Psychological_Testing#endnote_AERAnewsletter http:/ / www. apa. org/ science/ standards. html#overview http:/ / www. wmich. edu/ evalctr/ jc/ AERADivisionHNewsletterSeptember1977. pdf http:/ / www. apa. org/ science/ standards. html http:/ / www. aera. net/ publications/ Default. aspx?menu_id=46& id=1407
340
D013195
The development of the StanfordBinet Intelligence Scales initiated the modern field of intelligence testing and was one of the first examples of an adaptive test. The test originated in France, then was revised in the United States. The StanfordBinet test started with the French psychologist Alfred Binet, whom the French government commissioned with developing a method of identifying intellectually challenged children for their placement in special education programs. As Binet indicated, case studies might be more detailed and helpful, but the time required to test many people would be excessive. In 1916, at Stanford University, the psychologist Lewis Terman released a revised examination which became known as the "StanfordBinet test".
Development
Later, Alfred Binet and physician Theodore Simon collaborated in studying mental retardation in French school children. Theodore Simon was a student of Binet.[2] Between 1905 and 1908, their research at a boys' school, in Grange-aux-Belles, led to their developing the BinetSimon tests; assessing attention, memory, and verbal skill. The test consisted of 30 items ranging from the ability to touch one's nose or ear, when asked, to the ability to draw designs from memory and to define abstract concepts,[2] and varying in difficulty. Binet proposed that a child's intellectual ability increases with age. In June 1905, their test was published as the Binet-Simon Intelligence Test in L'Anne Psychologique. In this essay, they described three methods that should be employed to study "inferior states of intelligence." These methods include the medical method (anatomical, physiological, and pathological signs of inferior intelligence), the pedagogical method (judging intelligence based on a sum of acquired knowledge), and the psychological method (making direct observations and measurements of intelligence). They claimed that the psychological method is the most direct method because it measures intelligence as it is in the present moment by assessing his/her capacity to judge, comprehend, reason, and invent.[3] Both Binet and Simon's test was considerably accurate at determining a child's grades at school and they found that intelligence influences how well a child performs at school.[4] The original tests in the 1905 form include: 1. "Le Regard" 2. Prehension Provoked by a Tactile Stimulus 3. Prehension Provoked by a Visual Perception 4. Recognition of Food 5. Quest of Food Complicated by a Slight Mechanical Difficulty 6. Execution of Simple Commands and Imitation of Simple Gestures 7. Verbal Knowledge of Objects 8. Verbal Knowledge of Pictures 9. Naming of Designated Objects 10. Immediate Comparison of Two Lines of Unequal Lengths 11. Repetition of Three Figures 12. Comparison of Two Weights
StanfordBinet Intelligence Scales 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. Suggestibility Verbal Definition of Known Objects Repetition of Sentences of Fifteen Words Comparison of Known Objects from Memory Exercise of Memory on Pictures Drawing a Design from Memory Immediate Repetition of Figures Resemblances of Several Known Objects Given from Memory Comparison of Lengths Five Weights to be Placed in Order Gap in Weights Exercise upon Rhymes Verbal Gaps to be Filled Synthesis of Three Words in One Sentence Reply to an Abstract Question Reversal of the Hands of a Clock Paper Cutting
341
30. Definitions of Abstract Terms New forms of the test were published in 1908 and again in 1911, after extensive research using "normal" examinees in addition to examinees that were considered to be Mental retardation. In 1912, William Stern created the concept of mental age (MA): an individual's level of mental development relative to others.[2] Binet placed a confidence interval around the scores returned from his tests, both because he thought intelligence was somewhat plastic, and because of inherent margin of error in psychometric tests.[5] In 1916, the Stanford University psychologist Lewis Terman released the "Stanford Revision of the BinetSimon Scale", the "StanfordBinet", for short. He wrote The Measurement of Intelligence: An Explanation of and a Complete Guide for the Use of the Stanford Revision and Extension of the Binet-Simon Intelligence Scale, which provided English translations for the French items as well as new items. Despite other available translations, Terman is noted for his normative studies and methodological approach. With one of his graduate students at Stanford University, Maud Merrill, Terman created two parallel forms of the Stanford-Binet: Form L (for Lewis) and Form M (for Maud). Then, in the 1950s, Merrill revised the Stanford-Binet and created a new version that included what he considered to be the best test items from Forms L and M. This version was published in 1960 and renormed in 1973. Soon, the test was so popular that Robert Yerkes, the president of the American Psychological Association, decided to use it in developing the Army Alpha and the Army Beta tests to classify recruits.[citation needed] Thus, a high-scoring recruit might earn an A-grade (high officer material), whereas a low-scoring recruit with an E-grade would be rejected for military service.[5] The fourth edition of the test, which was published in 1986, converted from Binet's age-scale format to a point-scale format. The age-scale format, which was originally designed to provide a translation of the child's performance to mental age, was arguably inappropriate for more current generations of test-takers. The point scale arranged the tests into subtests, where all items of a type were administered together. The Fifth Edition includes the age-scale format to provide a variety of items at each level and to keep examinees interested. In 1960, the present day Stanford-Binet Scale replaced the ratio IQ with the deviation IQ. The deviation IQ compares and contrasts a child's score with numerous other scores obtained by other children of the same comparable age. This deviation IQ was developed by David Wechsler.[6] To test the validity of the Stanford-Binet Intelligence, three methods were used: 1. Professional judgement by researchers and examiners of all test items 2. Professional judgement by experts in CHC theory
StanfordBinet Intelligence Scales 3. Empirical Item analyses[7] Construct validity was obtained from the analyses of age trends for each of the five factor scores, which included both growth and decline, intercorrelations of tests, factors, IQs, and evidence for general ability.[8]
342
Timeline
April 1905: Development of Binet-Simon Test announced at a conference in Rome June 1905: Binet-Simon Intelligence Test introduced 1908 and 1911: New Versions of Binet-Simon Intelligence Test 1916: Stanford-Binet First Edition by Terman 1937: Second Edition by Terman and Merrill 1973: Third Edition by Merrill 1986: Fourth Edition by Thorndike, Hagen, and Sattler 2003: Fifth Edition by Roid
Scale
Binet's intelligence scale was divided into categories based on IQ score. The original names, which included "moron," "imbecile," and "idiot," among others, are no longer used. These categories were later replaced with words that were more descriptive of a scale of intellectual deficiency, marked from mild to profound deficiency.[9]
Present use
Since the inception of the StanfordBinet, it has been revised several times. Currently, the test is in its fifth edition, which is called the Stanford-Binet Intelligence Scales, Fifth Edition, or SB5. According to the publisher's website, "The SB5 was normed on a stratified random sample of 4,800 individuals that matches the 2000 U.S. Census." By administering the StanfordBinet test to large numbers of individuals selected at random from different parts of the United States, it has been found that the scores approximate a normal distribution. The revised edition of the Stanford-Binet over time has devised substantial changes in the way the tests are presented. The test has improved when looking at the introduction of a more parallel form and more demonstrative standards. For one, a non-verbal IQ component is included in the present day tests whereas in the past, there was only a verbal component. In fact, it now has equal balance of verbal and non-verbal content in the tests. It is also more animated than the other tests, providing the test-takers with more colourful artwork, toys and manipulatives. This allows the test to have a higher
StanfordBinet Intelligence Scales range in the age of the test takers.[10] This test is very useful in assessing the intellectual capabilities of people ranging from young children all the way to young adults. However, the test has come under criticism for not being able to compare people of different age categories, since each category gets a different set of tests. furthermore, very young children tend to do poor on the test due to the fact that they are lacking in the concentration needed to finish the test.[11] Current uses for the test include clinical and neuropsychological assessment, educational placement, compensation evaluations, career assessment, adult neuropsychological treatment, forensics, and research on aptitude.[12] Various high-IQ societies also accept this test for admission into their ranks; for example, the Triple Nine Society accepts a minimum qualifying score of 151 for Form L or M, 149 for Form LM if taken in 1986 or earlier, 149 for SB-IV, and 146 for SB-V; in all cases the applicant must have been at least 16 years old at the date of the test.[13]
343
References
[1] http:/ / www. nlm. nih. gov/ cgi/ mesh/ 2011/ MB_cgi?field=uid& term=D013195 [2] Santrock, John W. (2008) "A Topical Approach to Life-Span Development", (4th Ed.) Concept of Intelligence (283284), New York: McGrawHill. [3] Binet, Alfred. (1905) L'Annee Psychologique, 12,191-244. [4] Gilbert, D.T., Schacter, D.L., & Wegner, D.M. (2011). Psychology. New York, NY:Worth Publishers. [5] Fancher, Raymond E. (1985) "The Intelligence Men: Makers of the IQ Controversy", New York (NY): W. W. Norton. [6] Carlson, N. R. (2010). Psychology, the science of behaviour. (4 ed.). Upper Saddle River, New Jersey: Pearson Education, Inc. p.336 [7] Janzen, Henry, John Obrzut, and Christopher Marusiak. "Stanford-Binet Intelligent Scales." Canadian Journal of School Psychology. 19.1/2 (2004): 235-245. Web. 7 Mar. 2012. <http://journals2.scholarsportal.info.myaccess.library.utoronto.ca/tmp/15371491548931334389.pdf>. [8] Janzen, Henry, John Obrzut, and Christopher Marusiak. "Stanford-Binet Intelligent Scales." Canadian Journal of School Psychology. 19.1/2 (2004): 235-245. Web. 7 Mar. 2012. <http://journals2.scholarsportal.info.myaccess.library.utoronto.ca/tmp/15371491548931334389.pdf>. [10] "Stanford-Binet Intelligence Scales, Fifth Edition Assessment Service Bulletin Number 1" (http:/ / www. assess. nelson. com/ pdf/ sb5-asb1. pdf) [11] http:/ / www. minddisorders. com/ Py-Z/ Stanford-Binet-Intelligence-Scale. html [12] Riverside Publishing. Stanford-Binet Intelligence Scales, SB5, Fifth Edition. Accessed on December 2, 2011. http:/ / www. riversidepublishing. com/ products/ sb5/ details. html [13] http:/ / www. triplenine. org/ main/ admission. asp
Further reading
Becker, K.A (2003). "History of the Stanford-Binet Intelligence scales: Content and psychometrics.". Stanford-Binet Intelligence Scales, Fifth Edition Assessment Service Bulletin No. 1. Binet, Alfred; Simon, Th. (1916). The development of intelligence in children: The BinetSimon Scale (http:// books.google.com/books?id=jEQSAAAAYAAJ&dq=The development of intelligence in children Binet& pg=PA1#v=onepage&q&f=false). Publications of the Training School at Vineland New Jersey Department of Research No. 11. E. S. Kite (Trans.). Baltimore: Williams & Wilkins. Retrieved 18 July 2010. Brown, A. L.; French, L. A. (1979). "The Zone of Potential Development: Implications for Intelligence Testing in the Year 2000". Intelligence 3 (3): 255273. Fancher, Raymond E. (1985). The Intelligence Men: Makers of the IQ Controversy. New York (NY): W. W. Norton. ISBN978-0-393-95525-5. Freides, D. (1972). "Review of StanfordBinet Intelligence Scale, Third Revision". In Oscar Buros. Seventh Mental Measurements Yearbook. Highland Park (NJ): Gryphon Press. pp.772773. Gould, Stephen Jay (1981). The Mismeasure of Man. New York (NY): W. W. Norton. ISBN978-0-393-31425-0. Lay summary (http://www.nytimes.com/books/97/11/09/home/gould-mismeasure.html) (10 July 2010). McNemar, Quinn (1942). The revision of the StanfordBinet Scale. Boston: Houghton Mifflin. Pinneau, Samuel R. (1961). Changes in Intelligence Quotient Infancy to Maturity: New Insights from the Berkeley Growth Study with Implications for the StanfordBinet Scales and Applications to Professional Practice. Boston: Houghton Mifflin.
StanfordBinet Intelligence Scales Terman, Lewis Madison; Merrill, Maude A. (1937). Measuring intelligence: A guide to the administration of the new revised StanfordBinet tests of intelligence. Riverside textbooks in education. Boston (MA): Houghton Mifflin. Terman, Lewis Madison; Merrill, Maude A. (1960). StanfordBinet Intelligence Scale: Manual for the Third Revision Form LM with Revised IQ Tables by Samuel R. Pinneau. Boston (MA): Houghton Mifflin. Richardson, Nancy (1992). "StanfordBinet IV, of Course!: Time Marches On! (originally published as Which StanfordBinet for the Brightest?)" (http://www.davidsongifted.org/db/Articles_id_10128.aspx). Roeper Review 15 (1): 3234. Waddell, Deborah D. (1980). "The StanfordBinet: An Evaluation of the Technical Data Available since the 1972 Restandardization" (http://www.eric.ed.gov/ERICWebPortal/search/detailmini.jsp?_nfpb=true&_& ERICExtSearch_SearchValue_0=EJ233903&ERICExtSearch_SearchType_0=no&accno=EJ233903). Journal of School Psychology 18 (3): 203209. doi: 10.1016/0022-4405(80)90060-6 (http://dx.doi.org/10.1016/ 0022-4405(80)90060-6). Retrieved 29 June 2010.
344
Stanine
Stanine (STAndard NINE) is a method of scaling test scores on a nine-point standard scale with a mean of five and a standard deviation of two. This method was used for standardized school testing, but has now been replaced by other forms of measuring aptitude. Some web sources attribute stanines to the U.S. Army Air Forces during World War II. Psychometric legend has it that a 0-9 scale was used because of the compactness of recording the score as a single digit but Thorndike [1] claims that by reducing scores to just nine values, stanines "reduce the tendancy to try to interpret small score differences (p. 131)". The earliest known use of stanines was by the U.S. Army Air Forces in 1943.[citation needed] Test scores are scaled to stanine scores using the following algorithm: 1. Rank results from lowest to highest 2. Give the lowest 4% a stanine of 1, the next 7% a stanine of 2, etc., according to the following table:
Calculating Stanines
Result Ranking Stanine 4% 1 7% 2 12% 3 17% 4 20% 5 17% 6 12% 7 7% 8 4% 9
Standard score below -1.75 -1.75 to -1.25 -1.25 to -.75 -.75 to -.25 -.25 to +.25 +.25 to +.75 +.75 to +1.25 +1.25 to +1.75 above +1.75
The underlying basis for obtaining stanines is that a normal distribution is divided into nine intervals, each of which has a width of 0.5 standard deviations excluding the first and last, which are just the remainder (the tails of the distribution). The mean lies at the centre of the fifth interval. Stanines can be used to convert any test score into a single digit number. This was valuable when paper punch cards were the standard method of storing this kind of information. However, because all stanines are integers, two scores in a single stanine are sometimes further apart than two scores in adjacent stanines. This reduces their value. Today stanines are mostly used in educational assessment.[citation needed] The University of Alberta in Edmonton, Canada used the stanine system until 2003, when it switched to a 4-point scale [2]. In the United States, the Educational Records Bureau (they administer the "ERBs") reports test scores as stanines and percentiles.
Stanine
345
References
Ballew, Pat Origins of some arithmetic terms-2 [3]. Retrieved Dec. 26, 2004. Boydsten, Robert E. (February 27, 2000), Winning My Wings [4]
[1] [2] [3] [4] Thorndike, R. L. (1982). Applied Psychometrics. Boston, MA: Houghton Mifflin http:/ / www. registrar. ualberta. ca/ ro. cfm?id=184 http:/ / www. pballew. net/ arithme3. html#stanine http:/ / www. avca-sj. org/ WINGS32. html
346
Statistical hypothesis testing argument from ignorance. Unless a test with particularly high power is used, the idea of "accepting" the null hypothesis may be dangerous. Nonetheless the terminology is prevalent throughout statistics, where its meaning is well understood. Alternatively, if the testing procedure forces us to reject the null hypothesis (H0), we can accept the alternative hypothesis (H1) and we conclude that the research hypothesis is supported by the data. This fact expresses that our procedure is based on probabilistic considerations in the sense we accept that using another set of data could lead us to a different conclusion. The processes described here are perfectly adequate for computation. They seriously neglect the design of experiments considerations.[8][9] It is particularly critical that appropriate sample sizes be estimated before conducting the experiment.
347
Interpretation
If the p-value is less than the required significance level (equivalently, if the observed test statistic is in the critical region), then we say the null hypothesis is rejected at the given level of significance. Rejection of the null hypothesis is a conclusion. This is like a "guilty" verdict in a criminal trial the evidence is sufficient to reject innocence, thus proving guilt. We might accept the alternative hypothesis (and the research hypothesis). If the p-value is not less than the required significance level (equivalently, if the observed test statistic is outside the critical region), then the test has no result. The evidence is insufficient to support a conclusion. (This is like a jury that fails to reach a verdict.) The researcher typically gives extra consideration to those cases where the p-value is close to the significance level. In the Lady tasting tea example (above), Fisher required the Lady to properly categorize all of the cups of tea to justify the conclusion that the result was unlikely to result from chance. He defined the critical region as that case alone. The region was defined by a probability (that the null hypothesis was correct) of less than 5%. Whether rejection of the null hypothesis truly justifies acceptance of the research hypothesis depends on the structure of the hypotheses. Rejecting the hypothesis that a large paw print originated from a bear does not immediately prove the existence of Bigfoot. Hypothesis testing emphasizes the rejection which is based on a probability rather than the acceptance which requires extra steps of logic.
Statistical hypothesis testing plays an important role in the whole of statistics and in statistical inference. For example, Lehmann (1992) in a review of the fundamental paper by Neyman and Pearson (1933) says: "Nevertheless, despite their shortcomings, the new paradigm formulated in the 1933 paper, and the many developments carried out within its framework continue to play a central role in both the theory and practice of statistics and can be expected
Statistical hypothesis testing to do so in the foreseeable future". Significance testing has been the favored statistical tool in some experimental social sciences (over 90% of articles in the Journal of Applied Psychology during the early 1990s).[] Other fields have favored the estimation of parameters (e.g., effect size).
348
Cautions
"If the government required statistical procedures to carry warning labels like those on drugs, most inference methods would have long labels indeed."[] This caution applies to hypothesis tests and alternatives to them. The successful hypothesis test is associated with a probability and a type-I error rate. The conclusion might be wrong. The conclusion of the test is only as solid as the sample upon which it is based. The design of the experiment is critical. A number of unexpected effects have been observed including: The Clever Hans effect. A horse appeared to be capable of doing simple arithmetic. The Hawthorne effect. Industrial workers were more productive in better illumination, and most productive in worse. The Placebo effect. Pills with no medically active ingredients were remarkably effective. A statistical analysis of misleading data produces misleading conclusions. The issue of data quality can be more subtle. In forecasting for example, there is no agreement on a measure of forecast accuracy. In the absence of a consensus measurement, no decision based on measurements will be without controversy. The book How to Lie with Statistics[10][11] is the most popular book on statistics ever published.[12] It does not much consider hypothesis testing, but its cautions are applicable, including: Many claims are made on the basis of samples too small to convince. If a report does not mention sample size, be doubtful. Hypothesis testing acts as a filter of statistical conclusions; only those results meeting a probability threshold are publishable. Economics also acts as a publication filter; only those results favorable to the author and funding source may be submitted for publication. The impact of filtering on publication is termed publication bias. A related problem is that of multiple testing (sometimes linked to data mining), in which a variety of tests for a variety of possible effects are applied to a single data set and only those yielding a significant result are reported. These are often dealt with by using multiplicity correction procedures that control the family wise error rate (FWER) or the false discovery rate (FDR). Those making critical decisions based on the results of a hypothesis test are prudent to look at the details rather than the conclusion alone. In the physical sciences most results are fully accepted only when independently confirmed. The general advice concerning statistics is, "Figures never lie, but liars figure" (anonymous).
Examples
Analogy Courtroom trial
A statistical test procedure is comparable to a criminal trial; a defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt of the defendant. Only when there is enough charging evidence the defendant is convicted. In the start of the procedure, there are two hypotheses : "the defendant is not guilty", and : "the defendant is guilty". The first one is called null hypothesis, and is for the time being accepted. The second one is called alternative (hypothesis). It is the hypothesis one hopes to support. The hypothesis of innocence is only rejected when an error is very unlikely, because one doesn't want to convict an innocent defendant. Such an error is called error of the first kind (i.e., the conviction of an innocent person), and the occurrence of this error is controlled to be rare. As a consequence of this asymmetric behaviour, the error of the
Statistical hypothesis testing second kind (acquitting a person who committed the crime), is often rather large.
H0 is true Truly not guilty Accept Null Hypothesis Acquittal Right decision H1 is true Truly guilty Wrong decision Type II Error Right decision
349
A criminal trial can be regarded as either or both of two decision processes: guilty vs not guilty or evidence vs a threshold ("beyond a reasonable doubt"). In one view, the defendant is judged; in the other view the performance of the prosecution (which bears the burden of proof) is judged. A hypothesis test can be regarded as either a judgment of a hypothesis or as a judgment of evidence.
Statistical hypothesis testing When the test subject correctly predicts all 25 cards, we will consider him clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider him so. But what about 12 hits, or 17 hits? What is the critical number, c, of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value c? It is obvious that with the choice c=25 (i.e. we only accept clairvoyance when all cards are predicted correctly) we're more critical than with c=10. In the first case almost no test subjects will be recognized to be clairvoyant, in the second case, a certain number will pass the test. In practice, one decides how critical one will be. That is, one decides how often one accepts an error of the first kind a false positive, or Type I error. With c = 25 the probability of such an error is:
350
and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times. Being less critical, with c=10, gives:
Thus, c = 10 yields a much greater probability of false positive. Before the test is actually performed, the maximum acceptable probability of a Type I error () is determined. Typically, values in the range of 1% to 5% are selected. (If the maximum acceptable error rate is zero, an infinite number of correct guesses is required.) Depending on this Type 1 error rate, the critical value c is calculated. For example, if we select an error rate of 1%, c is calculated thus:
From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a false negative. For the above example, we select: .
Statistical hypothesis testing The test described here is more fully the null-hypothesis statistical significance test. The null hypothesis represents what we would believe by default, before seeing any evidence. Statistical significance is a possible finding of the test, declared when the observed sample is unlikely to have occurred by chance if the null hypothesis were true. The name of the test describes its formulation and its possible outcome. One characteristic of the test is its crisp decision: to reject or not reject the null hypothesis. A calculated value is compared to a threshold, which is determined from the tolerable risk of error.
351
Definition of terms
The following definitions are mainly based on the exposition in the book by Lehmann and Romano:[] Statistical hypothesis A statement about the parameters describing a population (not a sample). Statistic A value calculated from a sample, often to summarize the sample for comparison purposes. Simple hypothesis Any hypothesis which specifies the population distribution completely. Composite hypothesis Any hypothesis which does not specify the population distribution completely. Null hypothesis (H0) A simple hypothesis associated with a contradiction to a theory one would like to prove. Alternative hypothesis (H1) A hypothesis (often composite) associated with a theory one would like to prove. Statistical test A procedure whose inputs are samples and whose result is a hypothesis. Region of acceptance The set of values of the test statistic for which we fail to reject the null hypothesis. Region of rejection / Critical region The set of values of the test statistic for which the null hypothesis is rejected. Critical value The threshold value delimiting the regions of acceptance and rejection for the test statistic. Power of a test (1) The test's probability of correctly rejecting the null hypothesis. The complement of the false negative rate, . Power is termed sensitivity in biostatistics. ("This is a sensitive test. Because the result is negative, we can confidently say that the patient does not have the condition.") See sensitivity and specificity and Type I and type II errors for exhaustive definitions. Size / Significance level of a test () For simple hypotheses, this is the test's probability of incorrectly rejecting the null hypothesis. The false positive rate. For composite hypotheses this is the upper bound of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis. The complement of the false positive rate, (1), is termed specificity in biostatistics. ("This is a specific test. Because the result is positive, we can confidently say that the patient has the condition.") See sensitivity and specificity and Type I and type II errors for exhaustive definitions. p-value
Statistical hypothesis testing The probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic. Statistical significance test A predecessor to the statistical hypothesis test (see the Origins section). An experimental result was said to be statistically significant if a sample was sufficiently inconsistent with the (null) hypothesis. This was variously considered common sense, a pragmatic heuristic for identifying meaningful experimental results, a convention establishing a threshold of statistical evidence or a method for drawing conclusions from data. The statistical hypothesis test added mathematical rigor and philosophical consistency to the concept by making the alternative hypothesis explicit. The term is loosely used to describe the modern version which is now part of statistical hypothesis testing. Conservative test A test is conservative if, when constructed for a given nominal significance level, the true probability of incorrectly rejecting the null hypothesis is never greater than the nominal level. Exact test A test in which the significance level or critical value can be computed exactly, i.e., without any approximation. In some contexts this term is restricted to tests applied to categorical data and to permutation tests, in which computations are carried out by complete enumeration of all possible outcomes and their probabilities. A statistical hypothesis test compares a test statistic (z or t for examples) to a threshold. The test statistic (the formula found in the table below) is based on optimality. For a fixed level of Type I error rate, use of these statistics minimizes Type II error rates (equivalent to maximizing power). The following terms describe tests in terms of such optimality: Most powerful test For a given size or significance level, the test with the greatest power (probability of rejection) for a given value of the parameter(s) being tested, contained in the alternative hypothesis. Uniformly most powerful test (UMP) A test with the greatest power for all values of the parameter(s) being tested, contained in the alternative hypothesis.
352
Statistical hypothesis testing Chi-squared tests for variance are used to determine whether a normal population has a specified variance. The null hypothesis is that it does. Chi-squared tests of independence are used for deciding whether two variables are associated or are independent. The variables are categorical rather than numeric. It can be used to decide whether left-handedness is correlated with libertarian politics (or not). The null hypothesis is that the variables are independent. The numbers used in the calculation are the observed and expected frequencies of occurrence (from contingency tables). Chi-squared goodness of fit tests are used to determine the adequacy of curves fit to data. The null hypothesis is that the curve fit is adequate. It is common to determine curve shapes to minimize the mean square error, so it is appropriate that the goodness-of-fit calculation sums the squared errors. F-tests (analysis of variance, ANOVA) are commonly used when deciding whether groupings of data by category are meaningful. If the variance of test scores of the left-handed in a class is much smaller than the variance of the whole class, then it may be useful to study lefties as a group. The null hypothesis is that two variances are the same so the proposed grouping is not meaningful. In the table below, the symbols used are defined at the bottom of the table. Many other tests can be found in other articles.
Name One-sample z-test Formula Assumptions or notes (Normal population or n > 30) and known. (z is the distance from the mean in relation to the standard deviation of the mean). For non-normal distributions it is possible to calculate a minimum proportion of a population that falls within k standard deviations for any k (see: Chebyshev's inequality). Normal population and independent observations and 1 and 2 are known
353
Two-sample z-test
One-sample t-test
unknown
Paired t-test
unknown or small
[14] Two-sample unpooled t-test, unequal variances [14] (Normal populations or n1+n2>40) and independent observations and 1 2 both unknown
One-proportion z-test
n .p0 > 10 and n (1p0) > 10 and it is a SRS (Simple Random Sample), see notes. n1 p1 > 5 and n1(1p1) > 5 and n2 p2>5 and n2(1p2) > 5 and independent observations, see notes.
354
n1 p1 > 5 and n1(1p1) > 5 and n2 p2>5 and n2(1p2) > 5 and independent observations, see notes.
Normal population
df = k - 1 - # parameters estimated, and one of these must hold. All [15] expected counts are at least 5. All expected counts are >1 and no more than 20% of expected counts are [16] less than5
In general, the subscript 0 indicates a value taken from the null hypothesis, H0, which should be used as much as possible in constructing its test statistic. ... Definitions of other symbols: , the probability of Type I error (rejecting a null hypothesis when it is in fact true) = sample size = sample 1 size = sample 2 size = sample mean = hypothesized population mean = population 1 mean = population 2 mean = population standard deviation = population variance = sample standard deviation = sum (of k numbers) = sample variance = sample 1 standard deviation = sample 2 standard deviation = t statistic = degrees of freedom = sample mean of differences = hypothesized population mean difference = standard deviation of differences = Chi-squared statistic = F statistic
= x/n = sample proportion, unless specified otherwise = hypothesized population proportion = proportion 1 = proportion 2 = hypothesized difference in proportion = minimum of n1 and n2
355
Statistical hypothesis testing role of models in statistical inference.[] Events intervened: Neyman accepted a position in the western hemisphere, breaking his partnership with Pearson and separating disputants (who had occupied the same building) by much of the planetary diameter. World War II provided an intermission in the debate. The dispute between Fisher and Neyman terminated (unresolved after 27 years) with Fisher's death in 1962. Neyman wrote a well-regarded eulogy.[22] Some of Neyman's later publications reported p-values and significance levels.[23] The modern version of hypothesis testing is a hybrid of the two approaches that resulted from confusion by writers of statistical textbooks (as predicted by Fisher) beginning in the 1940s.[] (But signal detection, for example, still uses the Neyman/Pearson formulation.) Great conceptual differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher's method than theirs.[] This history explains the inconsistent terminology (example: the null hypothesis is never accepted, but there is a region of acceptance). Sometime around 1940,[] in an apparent effort to provide researchers with a "non-controversial"[] way to have their cake and eat it too, the authors of statistical text books began anonymously combining these two strategies by using the p-value in place of the test statistic (or data) to test against the Neyman-Pearson "significance level".[] Thus, researchers were encouraged to infer the strength of their data against some null hypothesis using p-values, while also thinking they are retaining the post-data collection objectivity provided by hypothesis testing. It then became customary for the null hypothesis, which was originally some realistic research hypothesis, to be used almost solely as a strawman "nil" hypothesis (one where a treatment has no effect, regardless of the context).[24] A comparison between Fisherian, frequentist (Neyman-Pearson)
Fisher's null hypothesis testing 1. Set up a statistical null hypothesis. The null need not be a nil hypothesis (i.e., zero difference). NeymanPearson decision theory 1. Set up two statistical hypotheses, H1 and H2, and decide about , , and sample size before the experiment, based on subjective cost-benefit considerations. These define a rejection region for each hypothesis.
356
2. Report the exact level of significance (e.g., p = 0.051 or p = 2. If the data falls into the rejection region of H1, accept H2; otherwise accept H1. 0.049). Do not use a conventional 5% level, and do not talk Note that accepting a hypothesis does not mean that you believe in it, but only that about accepting or rejecting hypotheses. you act as if it were true. 3. Use this procedure only if little is known about the problem 3. The usefulness of the procedure is limited among others to situations where you at hand, and only to draw provisional conclusions in the have a disjunction of hypotheses (e.g., either 1 = 8 or 2 = 10 is true) and where context of an attempt to understand the experimental you can make meaningful cost-benefit trade-offs for choosing alpha and beta. situation.
357
Criticism
Criticism of statistical hypothesis testing fills volumes[][26][][][][] citing 300400 primary references. Much of the criticism can be summarized by the following issues: Confusion resulting (in part) from combining the methods of Fisher and Neyman-Pearson which are conceptually distinct.[27] Emphasis on statistical significance to the exclusion of estimation and confirmation by repeated experiments.[28] Rigidly requiring statistical significance as a criterion for publication, resulting in publication bias.[29] Most of the criticism is indirect. Rather than being wrong, statistical hypothesis testing is misunderstood, overused and misused. "[I]t does not tell us what we want to know".[30] Lists of dozens of complaints are available.[][] Critics and supporters are largely in factual agreement regarding the characteristics of NHST: While it can provide critical information, it is inadequate as the sole tool for statistical analysis. Successfully rejecting the null hypothesis may offer no support for the research hypothesis. The continuing controversy concerns the selection of the best statistical practices for the near-term future given the (often poor) existing practices. Critics would prefer to ban NHST completely, forcing a complete departure from those practices, while supporters suggest a less absolute change. Controversy over significance testing, and its effects on publication bias in particular, has produced several results. The American Psychological Association has strengthened its statistical reporting requirements after review,[31] medical journal publishers have recognized the obligation to publish some results that are not statistically significant to combat publication bias[32] and a journal (Journal of Articles in Support of the Null Hypothesis) has been created to publish such results exclusively.[33] Textbooks have added some cautions[34] and increased coverage of the tools necessary to estimate the size of the sample required to produce significant results. Major organizations have not abandoned use of significance tests although some have discussed doing so.[31]
Statistical hypothesis testing have never seen the publication of a literally replicated experiment in psychology.[] An indirect approach to replication is meta-analysis. Bayesian inference is one alternative to significance testing.[citation needed] For example, Bayesian parameter estimation can provide rich information about the data from which researchers can draw inferences, while using uncertain priors that exert only minimal influence on the results when enough data is available. Psychologist Kruschke, John K. has suggested Bayesian estimation as an alternative for the t-test.[35] Alternatively two competing models/hypothesis can be compared using Bayes factors.[36] Bayesian methods could be criticized for requiring information that is seldom available in the cases where significance testing is most heavily used.[citation needed] Advocates of a Bayesian approach sometimes claim that the goal of a researcher is most often to objectively assess the probability that a hypothesis is true based on the data they have collected.[citation needed] Neither Fisher's significance testing, nor Neyman-Pearson hypothesis testing can provide this information, and do not claim to. The probability a hypothesis is true can only be derived from use of Bayes' Theorem, which was unsatisfactory to both the Fisher and Neyman-Pearson camps due to the explicit use of subjectivity in the form of the prior probability.[][37] Fisher's strategy is to sidestep this with the p-value (an objective index based on the data alone) followed by inductive inference, while Neyman-Pearson devised their approach of inductive behaviour.
358
Education
Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught.[38][39] Many conclusions reported in the popular press (political opinion polls to medical studies) are based on statistics. An informed public should understand the limitations of statistical conclusions[40][41][citation needed] and many college fields of study require a course in statistics for the same reason.[40][41][citation needed] An introductory college statistics class places much emphasis on hypothesis testing perhaps half of the course. Such fields as literature and divinity now include findings based on statistical analysis (see the Bible Analyzer). An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures (like z, Student's t, F and chi-squared). Statistical hypothesis testing is considered a mature area within statistics,[] but a limited amount of development continues. The cookbook method of teaching introductory statistics leaves no time for history, philosophy or controversy. Hypothesis testing has been taught as received unified method. Surveys showed that graduates of the class were filled with philosophical misconceptions (on all aspects of statistical inference) that persisted among instructors.[42] While the problem was addressed more than a decade ago,[43] and calls for educational reform continue,[44] students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing.[45] Ideas for improving the teaching of hypothesis testing include encouraging students to search for statistical errors in published papers, teaching the history of statistics and emphasizing the controversy in a generally dry subject.[]
References
[1] [2] [4] [6] R. A. Fisher (1925).Statistical Methods for Research Workers, Edinburgh: Oliver and Boyd, 1925, p.43. Schervish, M (1996) Theory of Statistics, p. 218. Springer ISBN 0-387-94546-6 Originally from Fisher's book Design of Experiments. Adr,J.H. (2008). Chapter 12: Modelling. In H.J. Adr & G.J. Mellenbergh (Eds.) (with contributions by D.J. Hand), Advising on Research Methods: A consultant's companion (pp. 183209). Huizen, The Netherlands: Johannes van Kessel Publishing [12] "Over the last fifty years, How to Lie with Statistics has sold more copies than any other statistical text." J. M. Steele. " Darrell Huff and Fifty Years of How to Lie with Statistics (http:/ / www-stat. wharton. upenn. edu/ ~steele/ Publications/ PDF/ TN148. pdf). Statistical Science, 20 (3), 2005, 205209. [14] NIST handbook: Two-Sample t-Test for Equal Means (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda353. htm) [15] Steel, R.G.D, and Torrie, J. H., Principles and Procedures of Statistics with Special Reference to the Biological Sciences., McGraw Hill, 1960, page 350. [17] NIST handbook: F-Test for Equality of Two Standard Deviations (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda359. htm) (Testing standard deviations the same as testing variances)
359
Further reading
Lehmann E.L. (1992) "Introduction to Neyman and Pearson (1933) On the Problem of the Most Efficient Tests of Statistical Hypotheses". In: Breakthroughs in Statistics, Volume 1, (Eds Kotz, S., Johnson, N.L.), Springer-Verlag. ISBN 0-387-94037-5 (followed by reprinting of the paper) Neyman, J.; Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses". Phil. Trans. R. Soc., Series A 231 (694706): 289337. doi: 10.1098/rsta.1933.0009 (http://dx.doi.org/10.1098/ rsta.1933.0009).
External links
Hazewinkel, Michiel, ed. (2001), "Statistical hypotheses, verification of" (http://www.encyclopediaofmath.org/ index.php?title=p/s087400), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4 Wilson Gonzlez, Georgina; Karpagam Sankaran (September 10, 1997). "Hypothesis Testing" (http://www.cee. vt.edu/ewr/environmental/teach/smprimer/hypotest/ht.html). Environmental Sampling & Monitoring Primer. Virginia Tech. Bayesian critique of classical hypothesis testing (http://www.cs.ucsd.edu/users/goguen/courses/275f00/stat. html) Critique of classical hypothesis testing highlighting long-standing qualms of statisticians (http://www.npwrc. usgs.gov/resource/methods/statsig/stathyp.htm) Dallal GE (2007) The Little Handbook of Statistical Practice (http://www.tufts.edu/~gdallal/LHSP.HTM) (A good tutorial) References for arguments for and against hypothesis testing (http://core.ecu.edu/psyc/wuenschk/StatHelp/ NHST-SHIT.htm)
Statistical hypothesis testing Statistical Tests Overview: (http://www.wiwi.uni-muenster.de/ioeb/en/organisation/pfaff/ stat_overview_table.html) How to choose the correct statistical test An Interactive Online Tool to Encourage Understanding Hypothesis Testing (http://wasser.heliohost.org/ ?l=en) A non mathematical way to understand Hypothesis Testing (http://simplifyingstats.com/data/ HypothesisTesting.pdf)
360
Statistical inference
In statistics, statistical inference is the process of drawing conclusions from data that is subject to random variation, for example, observational errors or sampling variation.[1] More substantially, the terms statistical inference, statistical induction and inferential statistics are used to describe systems of procedures that can be used to draw conclusions from datasets arising from systems affected by random variation,[2] such as observational errors, random sampling, or random experimentation.[1] Initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations. The outcome of statistical inference may be an answer to the question "what should be done next?", where this might be a decision about making further experiments or surveys, or about drawing a conclusion before implementing some organizational or governmental policy.
Introduction
Scope
For the most part, statistical inference makes propositions about populations, using data drawn from the population of interest via some form of random sampling. More generally, data about a random process is obtained from its observed behavior during a finite period of time. Given a parameter or hypothesis about which one wishes to make inference, statistical inference most often uses: a statistical model of the random process that is supposed to generate the data, which is known when randomization has been used, and a particular realization of the random process; i.e., a set of data. The conclusion of a statistical inference is a statistical proposition.[citation needed] Some common forms of statistical proposition are: an estimate; i.e., a particular value that best approximates some parameter of interest, a confidence interval (or set estimate); i.e., an interval constructed using a dataset drawn from a population so that, under repeated sampling of such datasets, such intervals would contain the true parameter value with the probability at the stated confidence level, a credible interval; i.e., a set of values containing, for example, 95% of posterior belief, rejection of a hypothesis[3] clustering or classification of data points into groups
Statistical inference
361
Models/Assumptions
Any statistical inference requires some assumptions. A statistical model is a set of assumptions concerning the generation of the observed data and similar data. Descriptions of statistical models usually emphasize the role of population quantities of interest, about which we wish to draw inference.[4] Descriptive statistics are typically used as a preliminary step before more formal inferences are drawn.[5]
Degree of models/assumptions
Statisticians distinguish between three levels of modeling assumptions; Fully parametric: The probability distributions describing the data-generation process are assumed to be fully described by a family of probability distributions involving only a finite number of unknown parameters.[4] For example, one may assume that the distribution of population values is truly Normal, with unknown mean and variance, and that datasets are generated by 'simple' random sampling. The family of generalized linear models is a widely used and flexible class of parametric models. Non-parametric: The assumptions made about the process generating the data are much less than in parametric statistics and may be minimal.[6] For example, every continuous probability distribution has a median, which may be estimated using the sample median or the HodgesLehmannSen estimator, which has good properties when the data arise from simple random sampling. Semi-parametric: This term typically implies assumptions 'in between' fully and non-parametric approaches. For example, one may assume that a population distribution has a finite mean. Furthermore, one may assume that the mean response level in the population depends in a truly linear manner on some covariate (a parametric assumption) but not make any parametric assumption describing the variance around that mean (i.e., about the presence or possible form of any heteroscedasticity). More generally, semi-parametric models can often be separated into 'structural' and 'random variation' components. One component is treated parametrically and the other non-parametrically. The well-known Cox model is a set of semi-parametric assumptions.
Statistical inference Approximate distributions Given the difficulty in specifying exact distributions of sample statistics, many methods have been developed for approximating these. With finite samples, approximation results measure how close a limiting distribution approaches the statistic's sample distribution: For example, with 10,000 independent samples the normal distribution approximates (to two digits of accuracy) the distribution of the sample mean for many population distributions, by the BerryEsseen theorem.[10] Yet for many practical purposes, the normal approximation provides a good approximation to the sample-mean's distribution when there are 10 (or more) independent samples, according to simulation studies and statisticians' experience.[10] Following Kolmogorov's work in the 1950s, advanced statistics uses approximation theory and functional analysis to quantify the error of approximation. In this approach, the metric geometry of probability distributions is studied; this approach quantifies approximation error with, for example, the KullbackLeibler distance, Bregman divergence, and the Hellinger distance.[11][12][13] With indefinitely large samples, limiting results like the central limit theorem describe the sample statistic's limiting distribution, if one exists. Limiting results are not statements about finite samples, and indeed are irrelevant to finite samples.[14][15][16] However, the asymptotic theory of limiting distributions is often invoked for work with finite samples. For example, limiting results are often invoked to justify the generalized method of moments and the use of generalized estimating equations, which are popular in econometrics and biostatistics. The magnitude of the difference between the limiting distribution and the true distribution (formally, the 'error' of the approximation) can be assessed using simulation.[17] The heuristic application of limiting results to finite samples is common practice in many applications, especially with low-dimensional models with log-concave likelihoods (such as with one-parameter exponential families).
362
Randomization-based models
For a given dataset that was produced by a randomization design, the randomization distribution of a statistic (under the null-hypothesis) is defined by evaluating the test statistic for all of the plans that could have been generated by the randomization design. In frequentist inference, randomization allows inferences to be based on the randomization distribution rather than a subjective model, and this is important especially in survey sampling and design of experiments.[18][19] Statistical inference from randomized studies is also more straightforward than many other situations.[20][21][22] In Bayesian inference, randomization is also of importance: in survey sampling, use of sampling without replacement ensures the exchangeability of the sample with the population; in randomized experiments, randomization warrants a missing at random assumption for covariate information.[23] Objective randomization allows properly inductive procedures.[24][25][26][27] Many statisticians prefer randomization-based analysis of data that was generated by well-defined randomization procedures.[28] (However, it is true that in fields of science with developed theoretical knowledge and experimental control, randomized experiments may increase the costs of experimentation without improving the quality of inferences.[29][30]) Similarly, results from randomized experiments are recommended by leading statistical authorities as allowing inferences with greater reliability than do observational studies of the same phenomena.[31] However, a good observational study may be better than a bad randomized experiment. The statistical analysis of a randomized experiment may be based on the randomization scheme stated in the experimental protocol and does not need a subjective model.[32][33] However, at any time, some hypotheses cannot be tested using objective statistical models, which accurately describe randomized experiments or random samples. In some cases, such randomized studies are uneconomical or unethical.
Statistical inference Model-based analysis of randomized experiments It is standard practice to refer to a statistical model, often a linear model, when analyzing data from randomized experiments. However, the randomization scheme guides the choice of a statistical model. It is not possible to choose an appropriate model without knowing the randomization scheme.[19] Seriously misleading results can be obtained analyzing data from randomized experiments while ignoring the experimental protocol; common mistakes include forgetting the blocking used in an experiment and confusing repeated measurements on the same experimental unit with independent replicates of the treatment applied to different experimental units.[34]
363
Modes of inference
Different schools of statistical inference have become established. These schools (or 'paradigms') are not mutually exclusive, and methods which work well under one paradigm often have attractive interpretations under other paradigms. The two main paradigms in use are frequentist and Bayesian inference, which are both summarized below.
Frequentist inference
This paradigm calibrates the production of propositionsWikipedia:Please clarify by considering (notional) repeated sampling of datasets similar to the one at hand. By considering its characteristics under repeated sample, the frequentist properties of any statistical inference procedure can be described although in practice this quantification may be challenging. Examples of frequentist inference P-value Confidence interval Frequentist inference, objectivity, and decision theory One interpretation of frequentist inference (or classical inference) is that it is applicable only in terms of frequency probability; that is, in terms of repeated sampling from a population. However, the approach of Neyman[35] develops these procedures in terms of pre-experiment probabilities. That is, before undertaking an experiment, one decides on a rule for coming to a conclusion such that the probability of being correct is controlled in a suitable way: such a probability need not have a frequentist or repeated sampling interpretation. In contrast, Bayesian inference works in terms of conditional probabilities (i.e. probabilities conditional on the observed data), compared to the marginal (but conditioned on unknown parameters) probabilities used in the frequentist approach. The frequentist procedures of significance testing and confidence intervals can be constructed without regard to utility functions. However, some elements of frequentist statistics, such as statistical decision theory, do incorporate utility functions.[citation needed] In particular, frequentist developments of optimal inference (such as minimum-variance unbiased estimators, or uniformly most powerful testing) make use of loss functions, which play the role of (negative) utility functions. Loss functions need not be explicitly stated for statistical theorists to prove that a statistical procedure has an optimality property.[36] However, loss-functions are often useful for stating optimality properties: for example, median-unbiased estimators are optimal under absolute value loss functions, in that they minimize expected loss, and least squares estimators are optimal under squared error loss functions, in that they minimize expected loss. While statisticians using frequentist inference must choose for themselves the parameters of interest, and the estimators/test statistic to be used, the absence of obviously explicit utilities and prior distributions has helped frequentist procedures to become widely viewed as 'objective'.[citation needed]
Statistical inference
364
Bayesian inference
The Bayesian calculus describes degrees of belief using the 'language' of probability; beliefs are positive, integrate to one, and obey probability axioms. Bayesian inference uses the available posterior beliefs as the basis for making statistical propositions. There are several different justifications for using the Bayesian approach. Examples of Bayesian inference Credible intervals for interval estimation Bayes factors for model comparison Bayesian inference, subjectivity and decision theory Many informal Bayesian inferences are based on "intuitively reasonable" summaries of the posterior. For example, the posterior mean, median and mode, highest posterior density intervals, and Bayes Factors can all be motivated in this way. While a user's utility function need not be stated for this sort of inference, these summaries do all depend (to some extent) on stated prior beliefs, and are generally viewed as subjective conclusions. (Methods of prior construction which do not require external input have been proposed but not yet fully developed.) Formally, Bayesian inference is calibrated with reference to an explicitly stated utility, or loss function; the 'Bayes rule' is the one which maximizes expected utility, averaged over the posterior uncertainty. Formal Bayesian inference therefore automatically provides optimal decisions in a decision theoretic sense. Given assumptions, data and utility, Bayesian inference can be made for essentially any problem, although not every statistical inference need have a Bayesian interpretation. Analyses which are not formally Bayesian can be (logically) incoherent; a feature of Bayesian procedures which use proper priors (i.e., those integrable to one) is that they are guaranteed to be coherent. Some advocates of Bayesian inference assert that inference must take place in this decision-theoretic framework, and that Bayesian inference should not conclude with the evaluation and summarization of posterior beliefs.
Statistical inference Fiducial inference Fiducial inference was an approach to statistical inference based on fiducial probability, also known as a "fiducial distribution". In subsequent work, this approach has been called ill-defined, extremely limited in applicability, and even fallacious.[43][44] However this argument is the same as that which shows[45] that a so-called confidence distribution is not a valid probability distribution and, since this has not invalidated the application of confidence intervals, it does not necessarily invalidate conclusions drawn from fiducial arguments. Structural inference Developing ideas of Fisher and of Pitman from 1938 to 1939,[46] George A. Barnard developed "structural inference" or "pivotal inference",[47] an approach using invariant probabilities on group families. Barnard reformulated the arguments behind fiducial inference on a restricted class of models on which "fiducial" procedures would be well-defined and useful.
365
Inference topics
The topics below are usually included in the area of statistical inference. 1. Statistical assumptions 2. 3. 4. 5. 6. 7. 8. Statistical decision theory Estimation theory Statistical hypothesis testing Revising opinions in statistics Design of experiments, the analysis of variance, and regression Survey sampling Summarizing statistical data
Notes
[1] [2] [3] [4] [6] [8] Upton, G., Cook, I. (2008) Oxford Dictionary of Statistics, OUP. ISBN 978-0-19-954145-4 Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP. ISBN 0-19-920613-9 (entry for "inferential statistics") According to Peirce, acceptance means that inquiry on this question ceases for the time being. In science, all scientific theories are revisable Cox (2006) page 2 van der Vaart, A.W. (1998) Asymptotic Statistics Cambridge University Press. ISBN 0-521-78450-6 (page 341) Freedman, D.A. (2008) "Survival analysis: An Epidemiological hazard?". The American Statistician (2008) 62: 110-119. (Reprinted as Chapter 11 (pages 169192) of: Freedman, D.A. (2010) Statistical Models and Causal Inferences: A Dialogue with the Social Sciences (Edited by David Collier, Jasjeet S. Sekhon, and Philip B. Stark.) Cambridge University Press. ISBN 978-0-521-12390-7) [9] Berk, R. (2003) Regression Analysis: A Constructive Critique (Advanced Quantitative Techniques in the Social Sciences) (v. 11) Sage Publications. ISBN 0-7619-2904-5 [10] Jrgen Hoffman-Jrgensen's Probability With a View Towards Statistics, Volume I. Page 399 [11] Le Cam (1986) [12] Erik Torgerson (1991) Comparison of Statistical Experiments, volume 36 of Encyclopedia of Mathematics. Cambridge University Press. [14] Kolmogorov (1963a) (Page 369): "The frequency concept, based on the notion of limiting frequency as the number of trials increases to infinity, does not contribute anything to substantiate the applicability of the results of probability theory to real practical problems where we have always to deal with a finite number of trials". (page 369) [15] "Indeed, limit theorems 'as UNIQ-math-0-0713b852e134e18e-QINU tends to infinity' are logically devoid of content about what happens at any particular UNIQ-math-1-0713b852e134e18e-QINU . All they can do is suggest certain approaches whose performance must then be checked on the case at hand." Le Cam (1986) (page xiv) [16] Pfanzagl (1994): "The crucial drawback of asymptotic theory: What we expect from asymptotic theory are results which hold approximately . . . . What asymptotic theory has to offer are limit theorems."(page ix) "What counts for applications are approximations, not limits." (page 188) [17] Pfanzagl (1994) : "By taking a limit theorem as being approximately true for large sample sizes, we commit an error the size of which is unknown. [. . .] Realistic information about the remaining errors may be obtained by simulations." (page ix) [18] Neyman, J.(1934) "On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection", Journal of the Royal Statistical Society, 97 (4), 557625
Statistical inference
[19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] Hinkelmann and Kempthorne(2008) ASA Guidelines for a first course in statistics for non-statisticians. (available at the ASA website) David A. Freedman et alia's Statistics. David S. Moore and George McCabe. Introduction to the Practice of Statistics. Gelman, Rubin. Bayesian Data Analysis. Peirce (1877-1878) Peirce (1883) David Freedman et alia Statistics and David A. Freedman Statistical Models. Rao, C.R. (1997) Statistics and Truth: Putting Chance to Work, World Scientific. ISBN 981-02-3111-3 Peirce, Freedman, Moore and McCabe. Box, G.E.P. and Friends (2006) Improving Almost Anything: Ideas and Essays, Revised Edition, Wiley. ISBN 978-0-471-72755-2 Cox (2006), page 196 ASA Guidelines for a first course in statistics for non-statisticians. (available at the ASA website)
366
David A. Freedman et alia's Statistics. David S. Moore and George McCabe. Introduction to the Practice of Statistics. [32] Neyman, Jerzy. 1923 [1990]. "On the Application of Probability Theory to AgriculturalExperiments. Essay on Principles. Section 9." Statistical Science 5 (4): 465472. Trans. Dorota M. Dabrowska and Terence P. Speed. [33] Hinkelmann & Kempthorne (2008) [34] Hinkelmann and Kempthorne (2008) Chapter 6. [35] Neyman, J. (1937) "Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability" (http:/ / links. jstor. org/ sici?sici=0080-4614(19370830)236:767<333:OOATOS>2. 0. CO;2-6), Philosophical Transactions of the Royal Society of London A, 236, 333380. [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] Preface to Pfanzagl. Soofi (2000) Hansen & Yu (2001) Hansen and Yu (2001), page 747. Rissanen (1989), page 84 Joseph F. Traub, G. W. Wasilkowski, and H. Wozniakowski. (1988) Judin and Nemirovski. Neyman (1956) Zabell (1992)} Cox (2006) page 66 Davison, page 12. Barnard, G.A. (1995) "Pivotal Models and the Fiducial Argument", International Statistical Review, 63 (3), 309323.
References
Bickel, Peter J.; Doksum, Kjell A. (2001). Mathematical statistics: Basic and selected topics 1 (Second (updated printing 2007) ed.). Pearson Prentice-Hall. ISBN0-13-850363-X. MR 443141 (http://www.ams.org/ mathscinet-getitem?mr=443141). Cox, D. R. (2006). Principles of Statistical Inference, CUP. ISBN 0-521-68567-2. Fisher, Ronald (1955) "Statistical methods and scientific induction" Journal of the Royal Statistical Society, Series B, 17, 6978. (criticism of statistical theories of Jerzy Neyman and Abraham Wald) Freedman, David A. (2009). Statistical models: Theory and practice (http://www.cambridge.org/catalogue/ catalogue.asp?isbn=9780521743853) (revised ed.). Cambridge University Press. pp.xiv+442 pp. ISBN978-0-521-74385-3. MR 2489600 (http://www.ams.org/mathscinet-getitem?mr=2489600). Hansen, Mark H.; Yu, Bin (June 2001). "Model Selection and the Principle of Minimum Description Length: Review paper". Journal of the American Statistical Association 96 (454): 746774. doi: 10.1198/016214501753168398 (http://dx.doi.org/10.1198/016214501753168398). JSTOR 2670311 (http:// www.jstor.org/stable/2670311). MR 1939352 (http://www.ams.org/mathscinet-getitem?mr=1939352). Hinkelmann, Klaus; Kempthorne, Oscar (2008). Introduction to Experimental Design (http://books.google. com/?id=T3wWj2kVYZgC&printsec=frontcover) (Second ed.). Wiley. ISBN978-0-471-72756-9. Kolmogorov, Andrei N. (1963a). "On Tables of Random Numbers". Sankhy Ser. A. 25: 369375. MR 178484 (http://www.ams.org/mathscinet-getitem?mr=178484).
Statistical inference Kolmogorov, Andrei N. (1963b). "On Tables of Random Numbers". Theoretical Computer Science 207 (2): 387395. doi: 10.1016/S0304-3975(98)00075-9 (http://dx.doi.org/10.1016/S0304-3975(98)00075-9). MR 1643414 (http://www.ams.org/mathscinet-getitem?mr=1643414). Le Cam, Lucian. (1986) Asymptotic Methods of Statistical Decision Theory, Springer. ISBN 0-387-96307-3 Neyman, Jerzy (1956). "Note on an Article by Sir Ronald Fisher". Journal of the Royal Statistical Society. Series B (Methodological) 18 (2): 288294. JSTOR 2983716 (http://www.jstor.org/stable/2983716). (reply to Fisher 1955) Peirce, C. S. (18771878), "Illustrations of the Logic of Science" (series), Popular Science Monthly, vols. 12-13. Relevant individual papers: (1878 March), "The Doctrine of Chances", Popular Science Monthly, v. 12, March issue, pp. 604 (http:// books.google.com/books?id=ZKMVAAAAYAAJ&jtp=604)615. Internet Archive Eprint (http://www. archive.org/stream/popscimonthly12yoummiss#page/612/mode/1up). (1878 April), "The Probability of Induction", Popular Science Monthly, v. 12, pp. 705 (http://books.google. com/books?id=ZKMVAAAAYAAJ&jtp=705)718. Internet Archive Eprint (http://www.archive.org/ stream/popscimonthly12yoummiss#page/715/mode/1up). (1878 June), "The Order of Nature", Popular Science Monthly, v. 13, pp. 203 (http://books.google.com/ books?id=u8sWAQAAIAAJ&jtp=203)217.Internet Archive Eprint (http://www.archive.org/stream/ popularsciencemo13newy#page/203/mode/1up). (1878 August), "Deduction, Induction, and Hypothesis", Popular Science Monthly, v. 13, pp. 470 (http:// books.google.com/books?id=u8sWAQAAIAAJ&jtp=470)482. Internet Archive Eprint (http://www. archive.org/stream/popularsciencemo13newy#page/470/mode/1up). Peirce, C. S. (1883), "A Theory of Probable Inference", Studies in Logic, pp. 126-181 (http://books.google. com/books?id=V7oIAAAAQAAJ&pg=PA126), Little, Brown, and Company. (Reprinted 1983, John Benjamins Publishing Company, ISBN 90-272-3271-7) Pfanzagl, Johann; with the assistance of R. Hambker (1994). Parametric Statistical Theory. Berlin: Walter de Gruyter. ISBN3-11-013863-8. MR 1291393 (http://www.ams.org/mathscinet-getitem?mr=1291393). Rissanen, Jorma (1989). Stochastic Complexity in Statistical Inquiry. Series in computer science 15. Singapore: World Scientific. ISBN9971-5-0859-1. MR 1082556 (http://www.ams.org/mathscinet-getitem?mr=1082556). Soofi, Ehsan S. (December 2000). "Principal Information-Theoretic Approaches (Vignettes for the Year 2000: Theory and Methods, ed. by George Casella)". Journal of the American Statistical Association 95 (452): 13491353. JSTOR 2669786 (http://www.jstor.org/stable/2669786). MR 1825292 (http://www.ams.org/ mathscinet-getitem?mr=1825292). Traub, Joseph F.; Wasilkowski, G. W.; Wozniakowski, H. (1988). Information-Based Complexity. Academic Press. ISBN0-12-697545-0. Zabell, S. L. (Aug. 1992). "R. A. Fisher and Fiducial Argument". Statistical Science 7 (3): 369387. doi: 10.1214/ss/1177011233 (http://dx.doi.org/10.1214/ss/1177011233). JSTOR 2246073 (http://www.jstor. org/stable/2246073).
367
Statistical inference
368
Further reading
Casella, G., Berger, R.L. (2001). Statistical Inference. Duxbury Press. ISBN 0-534-24312-6 David A. Freedman. "Statistical Models and Shoe Leather" (1991). Sociological Methodology, vol. 21, pp.291313. David A. Freedman. Statistical Models and Causal Inferences: A Dialogue with the Social Sciences. 2010. Edited by David Collier, Jasjeet S. Sekhon, and Philip B. Stark. Cambridge University Press. Kruskal, William (December 1988). "Miracles and Statistics: The Casual Assumption of Independence (ASA Presidential address)". Journal of the American Statistical Association 83 (404): 929940. JSTOR 2290117 (http://www.jstor.org/stable/2290117). Lenhard, Johannes (2006). "Models and Statistical Inference: The Controversy between Fisher and NeymanPearson," British Journal for the Philosophy of Science, Vol. 57 Issue 1, pp.6991. Lindley, D. (1958). "Fiducial distribution and Bayes' theorem", Journal of the Royal Statistical Society, Series B, 20, 1027 Sudderth, William D. (1994). "Coherent Inference and Prediction in Statistics," in Dag Prawitz, Bryan Skyrms, and Westerstahl (eds.), Logic, Methodology and Philosophy of Science IX: Proceedings of the Ninth International Congress of Logic, Methodology and Philosophy of Science, Uppsala, Sweden, August 714, 1991, Amsterdam: Elsevier. Trusted, Jennifer (1979). The Logic of Scientific Inference: An Introduction, London: The Macmillan Press, Ltd. Young, G.A., Smith, R.L. (2005) Essentials of Statistical Inference, CUP. ISBN 0-521-83971-8
External links
MIT OpenCourseWare (http://dspace.mit.edu/handle/1721.1/45587): Statistical Inference
Survey methodology
A field of applied statistics, survey methodology studies the sampling of individual units from a population and the associated survey data collection techniques, such as questionnaire construction and methods for improving the number and accuracy of responses to surveys. Statistical surveys are undertaken with a view towards making statistical inferences about the population being studied, and this is depends strongly on the survey questions used. Polls about public opinion, public health surveys, market research surveys, government surveys and censuses are all examples of quantitative research that use contemporary survey methodology to answers questions about a population. Although censuses do not include a "sample", they do include other aspects of survey methodology, like questionnaires, interviewers, and nonresponse follow-up techniques. Surveys provide important information for all kinds of public information and research fields, e.g., marketing research, psychology, health professionals and sociology.[1] A single survey is made of at least a sample (or full population in the case of a census), a method of data collection (e.g., a questionnaire) and individual questions or items that become data that can be analyzed statistically. A single survey may focus on different types of topics such as preferences (e.g., for a presidential candidate), opinions (e.g., should abortion be legal?), behavior (smoking and alcohol use), or factual information (e.g., income), depending on its purpose. Since survey research is almost always based on a sample of the population, the success of the research is dependent on the representativeness of the sample with respect to a target population of interest to the researcher. That target population can range from the general population of a given country to specific groups of people within that country, to a membership list of a professional organization, or list of students enrolled in a school system (see also sampling (statistics) and survey sampling).
Survey methodology Survey methodology as a scientific field seeks to identify principles about the sample design, data collection instruments, statistical adjustment of data, and data processing, and final data analysis that can create systematic and random survey errors. Survey errors are sometimes analyzed in connection with survey cost. Cost constraints are sometimes framed as improving quality within cost constraints, or alternatively, reducing costs for a fixed level of quality. Survey methodology is both a scientific field and a profession, meaning that some professionals in the field focus on survey errors empirically and others design surveys to reduce them. For survey designers, the task involves making a large set of decisions about thousands of individual features of a survey in order to improve it.[] The most important methodological challenges of a survey methodologist include making decisions on how to:[] Identify and select potential sample members. Contact sampled individuals and collect data from those who are hard to reach (or reluctant to respond). Evaluate and test questions. Select the mode for posing questions and collecting responses. Train and supervise interviewers (if they are involved). Check data files for accuracy and internal consistency. Adjust survey estimates to correct for identified errors.
369
Selecting samples
Survey samples can be broadly divided into two types: probability samples and non-probability samples. Stratified sampling is a method of probability sampling such that sub-populations within an overall population are identified and included in the sample selected in a balanced way.
Survey methodology
370
Response formats
Usually, a survey consists of a number of questions that the respondent has to answer in a set format. A distinction is made between open-ended and closed-ended questions. An open-ended question asks the respondent to formulate his or her own answer, whereas a closed-ended question has the respondent pick an answer from a given number of options. The response options for a closed-ended question should be exhaustive and mutually exclusive. Four types of response scales for closed-ended questions are distinguished: Dichotomous, where the respondent has two options Nominal-polytomous, where the respondent has more than two unordered options Ordinal-polytomous, where the respondent has more than two ordered options (Bounded) continuous, where the respondent is presented with a continuous scale
A respondent's answer to an open-ended question can be coded into a response scale afterwards,[2] or analysed using more qualitative methods.
Advantages
They are relatively easy and inexpensive to administer for the simplest of designs. Simply administering a survey does not require a lot of technical expertise, if quality of the data is not a major concern. If conducted remotely, can reduce or obviate geographical dependence. Useful in describing the characteristics of a large population assuming the sampling is valid. Can be administered remotely via the Web, mobile devices, mail, e-mail, telephone, etc. Efficient at collecting information from a large number of respondents for a fixed cost compared to other methods. Statistical techniques can be applied to the survey data to determine validity, reliability, and statistical significance even when analyzing multiple variables. Many questions can be asked about a given topic giving considerable flexibility to the analysis. Support both between and within-subjects study designs. A wide range of information can be collected (e.g., attitudes, values, beliefs, and behaviour). Compared to qualitative interviewing, standardized survey questions provide all the participants with a standardized stimulus.
Survey methodology
371
Disadvantages
The validity and reliability (i.e,. variance and bias) of survey data may depend on the following: Respondents' motivation, honesty, memory, and ability to respond. Respondents may not be fully aware of their reasons for any given action, making surveys weak methods for things that respondents cannot report consciously and accurately. Structured surveys, particularly those with closed ended questions, may have low validity when researching affective variables. Self-selection bias. Although the individuals chosen to participate in surveys are usually randomly sampled, errors due to nonresponse may exist (see also chapter 13 of Adr et al. (2008) for more information on how to deal with nonresponders bias in survey estimates). That is, people who choose to respond on the survey may be different from those who do not respond, thus biasing the estimates. For example, people who are not at home regularly will be more difficult to contact than those who are at home a lot, and thus hard to contact with a face-to-face or telephone survey that uses only landline numbers. The overall inference is limited by the sampling frame chosen. For example, polls or surveys that are conducted by calling a random sample of publicly available telephone numbers will not include the responses of people with unlisted telephone numbers, mobile (cell) phone numbers. Even random digit dial sampling frames of landlines have been shown to under-represent certain individuals (and their behaviors), specifically those who only have a cell phone. Question and questionnaire design: Survey question answer-choices could lead to vague data sets because at times they are relative only to a personal abstract notion concerning "strength of choice". For instance the choice "moderately agree" may mean different things to different subjects, and to anyone interpreting the data for correlation. Even 'yes' or 'no' answers are problematic because subjects may for instance put "no" if the choice "only once" is not available.
Nonresponse reduction
The following ways have been recommended for reducing nonresponse[4] in telephone and face-to-face surveys:[5] Advance letter. A short letter is sent in advance to inform the sampled respondents about the upcoming survey. The style of the letter should be personalized but not overdone. First, it announces that a phone call will be made/ or an interviewer wants to make an appointment to do the survey face-to-face. Second, the research topic will be described. Last, it allows both an expression of the surveyor's appreciation of cooperation and an opening to ask questions on the survey. Training. The interviewers are thoroughly trained in how to ask respondents questions, how to work with computers and making schedules for callbacks to respondents who were not reached. Short introduction. The interviewer should always start with a short instruction about him or herself. She/he should give her name, the institute she is working for, the length of the interview and goal of the interview. Also it can be useful to make clear that you are not selling anything: this has been shown to lead led to a slightly higher responding rate.[6] Respondent-friendly survey questionnaire. The questions asked must be clear, non-offensive and easy to respond to for the subjects under study. Brevity is also often cited as increasing response rate. A 1996 literature review found mixed evidence to support this claim for both written and verbal surveys, concluding that other factors may often be more important.[7] A 2010 study by SurveyMonkey looking at 100,000 of the online surveys they host found response rate dropped by about 3% at 10 questions and about 6% at 20 questions, with dropoff slowing (for example, only 10% reduction at 40 questions)[8] Other studies showed that quality of response degraded toward the end of long surveys.[9]
Survey methodology
372
Interviewer effects
Survey methodologists have devoted much effort to determine the extent to which interviewee responses are affected by physical characteristics of the interviewer. Main interviewer traits that have been demonstrated to influence survey responses are race [10] , gender [11] and relative body weight (BMI) .[12] These interviewer effects are particularly operant when questions are related to the interviewer trait. Hence, race of interviewer has been shown to affect responses to measures regarding racial attitudes ,[13] interviewer sex responses to questions involving gender issues ,[14] and interviewer BMI answers to eating and dieting-related questions .[15] While interviewer effects have been investigated mainly for face-to-face surveys, they have also been shown to exist for interview modes with no visual contact, such as telephone surveys and in video-enhanced web surveys. The explanation typically provided for interviewer effects is that of social desirability. Survey participants may attempt to project a positive self-image in an effort to conform to the norms they attribute to the interviewer asking questions.
Notes
[1] http:/ / whatisasurvey. info/ [2] Mellenbergh, G.J. (2008). Chapter 9: Surveys. In H.J. Adr & G.J. Mellenbergh (Eds.) (with contributions by D.J. Hand), Advising on Research Methods: A consultant's companion (pp. 183209). Huizen, The Netherlands: Johannes van Kessel Publishing. [3] Lynn, P. (2009) (Ed.) Methodology of Longitudinal Surveys. Wiley. ISBN 0-470-01871-2 [4] Lynn, P. (2008) "The problem of non-response", chapter 3, 35-55, in International Handbook of Survey Methodology (ed.s E.de Leeuw, J.Hox & D.Dillman). Erlbaum. ISBN 0-8058-5753-2 [5] Dillman, D.A. (1978) Mail and telephone surveys: The total design method. Wiley. ISBN 0-471-21555-4 [6] De Leeuw, E.D. (2001). "I am not selling anything: Experiments in telephone introductions". Kwantitatieve Methoden, 22, 4148. [9] http:/ / www. research-live. com/ news/ news-headlines/ respondent-engagement-and-survey-length-the-long-and-the-short-of-it/ 4002430. article
Survey methodology
373
References
Abramson, J.J. and Abramson, Z.H. (1999).Survey Methods in Community Medicine: Epidemiological Research, Programme Evaluation, Clinical Trials (5th edition). London: Churchill Livingstone/Elsevier Health Sciences ISBN 0-443-06163-7 Groves, R.M. (1989). Survey Errors and Survey Costs Wiley. ISBN 0-471-61171-9 Ornstein, M.D. (1998). "Survey Research." Current Sociology 46(4): iii-136. Shaughnessy, J. J., Zechmeister, E. B., & Zechmeister, J. S. (2006). Research Methods in Psychology (Seventh Edition ed.). McGrawHill Higher Education. ISBN 0-07-111655-9 (pp.143192) Adr, H. J., Mellenbergh, G. J., & Hand, D. J. (2008). Advising on research methods: A consultant's companion. Huizen, The Netherlands: Johannes van Kessel Publishing. Dillman, D.A. (1978) Mail and telephone surveys: The total design method. New York: Wiley. ISBN 0-471-21555-4
Further reading
Andres, Lesley (2012). "Designing and Doing Survey Research" (http://www.uk.sagepub.com/books/ Book234957?siteId=sage-uk&prodTypes=any&q=andres). London: Sage. Leung, Wai-Ching (2001) "Conducting a Survey" (http://archive.student.bmj.com/back_issues/0601/ education/187.html), in Student BMJ, (British Medical Journal, Student Edition), May 2001
External links
Surveys (http://www.dmoz.org/Science/Social_Sciences/Methodology/Survey/) at the Open Directory Project Nonprofit Research Collection on the Use of Surveys in Nonprofit Research (http://www.issuelab.org/closeup/ Jan_2009/) Published on IssueLab
Sten scores
374
Sten scores
The results for some scales of some psychometric instruments are returned as sten scores, sten being an abbreviation for 'Standard Ten' and thus closely related to stanine scores.
Definition
A sten score indicates an individual's approximate position (as a range of values) with respect to the population of values and, therefore, to other people in that population. The individual sten scores are defined by reference to a standard normal distribution. Unlike stanine scores, which have a midpoint of five, sten scores have no midpoint (the midpoint is the value 5.5). Like stanines, individual sten scores are demarcated by half standard deviations. Thus, a sten score of 5 includes all standard scores from -.5 to zero and is centered at -0.25 and a sten score of 4 includes all standard scores from -1.0 to -0.5 and is centered at -0.75. A sten score of 1 includes all standard scores below -2.0. Sten scores of 6-10 "mirror" scores 5-1. The table below shows the standard scores that define stens and the percent of individuals drawn from a normal distribution that would receive sten score.
Sten scores (for the entire population of results) have a mean of 5.5 and a standard deviation of 2.[1]
References
[1] McNab, D. et al Career Values Scale: Manual & Users' Guide, Psychometrics Publishing, 2005. [2] Russell, M.T., & Karol, D. (2002). The 16PF Fifth Edition administrator's manual. Champaign, IL: Institute for Personality and Ability Testing
375
Equivalent models
In SEM, many models are equivalent in that they predict the same mean vector and covariance matrix. A "cleaned" model representation would be to model the mean and covariance matrix directly. That is, a "Clean Normal Model" (CNM) is a model with a function for every entry of the covariance matrix and mean. In terms of path diagrams, CNM is the subset of SEMs that only have squares connected by double-headed edges. CNMs are not popular for at least two reasons: CNMs are very difficult for human readers to interpret. Humans typically like to think of covariances as common sources or causations. It helps people to think of models, even though it also entails the danger of over-interpreting regressions as causations. Note that this mere re-representation had big successes: The IQ index, for example, although by definition not existent in the real world, has arguably the greatest success in Psychology, with lots of predictive power for many things.. It helps us to integrate variables that we propose do exist, but have not been measured (yet). We could for example build a model in which ion flow in certain brain cells is a latent variable and in that way make a prediction about the covariance of the ion flow to observable variables, which, if someone invents a measurement instrument that allows the measurement of this flow in the specific region, can be falsified or confirmed.
376
377
378
Model modification
The model may need to be modified in order to improve the fit, thereby estimating the most likely relationships between variables. Many programs provide modification indices which report the improvement in fit that results from adding an additional path to the model. Modifications that improve model fit are then flagged as potential changes that can be made to the model. In addition to improvements in model fit, it is important that the modifications also make theoretical sense.
379
Advanced uses
Invariance Multiple group modelling: This is a technique allowing joint estimation of multiple models, each with different sub-groups. Applications include behavior genetics, and analysis of differences between groups (e.g., gender, cultures, test forms written in different languages, etc.). Latent growth modeling Hierarchical/multilevel models; item response theory models Mixture model (latent class) SEM Alternative estimation and testing techniques Robust inference Survey sampling analyses Multi-method multi-trait models Structural Equation Model Trees
SEM-specific software
Open source software R has several contributed packages dealing with SEM Commercial packages AMOS in SPSS Stata SAS (software) procedures MPlus [5]
References
[3] Bollen, K A, and Long, S J (1993) Testing Structural Equation Models. SAGE Focus Edition, vol. 154, ISBN 0-8039-4507-8 [5] http:/ / www. statmodel. com/
Further reading
Bagozzi, R.; Yi, Y. (2012) "Specification, evaluation, and interpretation of structural equation models". Journal of the Academy of Marketing Science, 40 (1), 834. doi: 10.1007/s11747-011-0278-x (http://dx.doi.org/10.1007/ s11747-011-0278-x) Bartholomew, D J, and Knott, M (1999) Latent Variable Models and Factor Analysis Kendall's Library of Statistics, vol. 7. Arnold publishers, ISBN 0-340-69243-X Bentler, P.M. & Bonett, D.G. (1980). "Significance tests and goodness of fit in the analysis of covariance structures". Psychological Bulletin, 88, 588-606. Bollen, K A (1989). Structural Equations with Latent Variables. Wiley, ISBN 0-471-01171-1 Byrne, B. M. (2001) Structural Equation Modeling with AMOS - Basic Concepts, Applications, and Programming.LEA, ISBN 0-8058-4104-0 Goldberger, A. S. (1972). Structural equation models in the social sciences. Econometrica 40, 979- 1001. Haavelmo, T. (1943) "The statistical implications of a system of simultaneous equations," Econometrica 11:12. Reprinted in D.F. Hendry and M.S. Morgan (Eds.), The Foundations of Econometric Analysis, Cambridge University Press, 477490, 1995. Hair, Joe F., G. Tomas M. Hult, Christian M. Ringle, and Marko Sarstedt. 2013. A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM). Thousand Oaks: Sage. http://www.sagepub.com/books/ Book237345
Structural equation modeling Hoyle, R H (ed) (1995) Structural Equation Modeling: Concepts, Issues, and Applications. SAGE, ISBN 0-8039-5318-6 Kaplan, D (2000) Structural Equation Modeling: Foundations and Extensions. SAGE, Advanced Quantitative Techniques in the Social Sciences series, vol. 10, ISBN 0-7619-1407-2 Kline, R. B. (2010) Principles and Practice of Structural Equation Modeling (3rd Edition). The Guilford Press, ISBN 978-1-60623-877-6 Jreskog, K.; F. Yang (1996). "Non-linear structural equation models: The Kenny-Judd model with interaction effects". In G. Marcoulides and R. Schumacker, (eds.), Advanced structural equation modeling: Concepts, issues, and applications. Thousand Oaks, CA: Sage Publications.
380
External links
Ed Rigdon's Structural Equation Modeling Page (http://www2.gsu.edu/~mkteer/): people, software and sites Structural equation modeling page under David Garson's StatNotes, NCSU (http://www2.chass.ncsu.edu/ garson/pa765/structur.htm) Issues and Opinion on Structural Equation Modeling (http://disc-nt.cba.uh.edu/chin/ais/), SEM in IS Research The causal interpretation of structural equations (or SEM survival kit) by Judea Pearl 2000. (http://bayes.cs. ucla.edu/BOOK-2K/jw.html) Structural Equation Modeling Reference List by Jason Newsom (http://www.upa.pdx.edu/IOA/newsom/ semrefs.htm): journal articles and book chapters on structural equation models PLS-SEM book (http://www.pls-sem.com//): online resources and additional information Path Analysis in AFNI (http://afni.nimh.nih.gov/sscc/gangc/PathAna.html): The open source (GPL) AFNI (http://afni.nimh.nih.gov) package contains SEM code Handbook of Management Scales (http://en.wikibooks.org/wiki/Handbook_of_Management_Scales), a collection of previously used multi-item scales to measure constructs for SEM
Lewis Terman
381
Lewis Terman
Lewis Terman
Born
January 15, 1877 Johnson County, Indiana December 21, 1956 (aged79) Palo Alto, California American Psychology Stanford University Los Angeles Normal School
Died
Alma mater Clark University Indiana University Bloomington Central Normal College
Lewis Madison Terman (January 15, 1877 December 21, 1956) was an American psychologist, noted as a pioneer in educational psychology in the early 20th century at the Stanford University School of Education. He is best known as the inventor of the Stanford-Binet IQ test and the initiator of the longitudinal study of children with high IQs called the Genetic Studies of Genius.[1] He was a prominent eugenicist and was a member of the Human Betterment Foundation. He also served as president of the American Psychological Association.
Biography
Terman received a B.S., B.Pd. (Bachelor of Pedagogy), and B.A. from Central Normal College in 1894 and 1898, and a B.A. and M.A. from the Indiana University Bloomington in 1903. He received his Ph.D. from Clark University in 1905. He worked as a school principal in San Bernardino, California in 1905, and as a professor at Los Angeles Normal School in 1907. In 1910 he joined the faculty of Stanford University as a professor of educational psychology at the invitation of Ellwood Patterson Cubberley and remained associated with the university until his death. He served as chairman of the psychology department from 1922 to 1945. Terman published the Stanford Revision of the Binet-Simon Scale in 1916 and revisions were released in 1937 and 1960.[2] Original work on the test had been completed by Alfred Binet and Thodore Simon of France. Terman promoted his test, known colloquially as the "Stanford-Binet" test, as an aid for the classification of developmentally disabled children. Revisions of the Stanford-Binet are still used today as a general intelligence test for adults and children. The fifth revision of the test is currently in use.
Lewis Terman The first mass administration of IQ testing was done with 1.7 million soldiers during World War I, when Terman served in a psychological testing role with the United States military. Terman was able to work with other applied psychologists to categorize army recruits. The recruits were given group intelligence tests which took about an hour to administer. Testing options included Army alpha, a text-based test, and Army beta, a picture-based test for nonreaders. 25% could not complete the Alpha test.[3] The examiners scored the tests on a scale ranging from "A" through "E". Recruits who earned scores of "A" would be trained as officers while those who earned scores of "D" and "E" would never receive officer training. The work of psychologists during the war proved to Americans that intelligence tests could have broader utility. After the war Terman and his colleagues pressed for intelligence tests to be used in schools to improve the efficiency of growing American schools. He also administered English tests to Spanish-speakers and unschooled African-Americans, concluding: High-grade or border-line deficiency... is very, very common among Spanish-Indian and Mexican families of the Southwest and also among negroes. Their dullness seems to be racial, or at least inherent in the family stocks from which they come... Children of this group should be segregated into separate classes... They cannot master abstractions but they can often be made into efficient workers... from a eugenic point of view they constitute a grave problem because of their unusually prolific breeding (The Measurement of Intelligence, 1916, p. 91-92). Unlike Binet and Simon, whose goal was to identify less able school children in order to aid them with the needed care required, Terman proposed using IQ tests to classify children and put them on the appropriate job-track. He believed IQ was inherited and was the strongest predictor of one's ultimate success in life. Terman adopted William Stern's suggestion that mental age/chronological age times 100 be made the intelligence quotient or IQ. (NB: Most modern IQ tests calculate the intelligence quotient differently.) In 1921, Terman initiated the Genetic Studies of Genius, a long-term study of gifted children. He found that gifted children did not fit the existing stereotypes often associated with them: they were not weak and sickly social misfits, but in fact were generally taller, in better health, better developed physically, and better adapted socially than other children. The children included in his studies were colloquially referred to as "Termites".[4] Terman later joined the Human Betterment Foundation, a Pasadena-based eugenics group founded by E.S. Gosney in 1928 which had as part of its agenda the promotion and enforcement of compulsory sterilization laws in California. Terman Middle School in Palo Alto, California is named after himself and his son. His son Frederick Terman, as provost of Stanford University, greatly expanded the science, statistics and engineering departments that helped catapult Stanford into the ranks of the world's first class educational institutions, as well as spurring the growth of Silicon Valley.
382
Lewis Terman plays a large role in determining intelligence, but nurture (the environment) is also important in fostering the innate intellectual ability. By his own admission there was nothing in his own ancestry that would have led anyone to predict him to have an intellectual career.[10] With Binets development of IQ tests, it became possible to quickly identify gifted children and study them from their early childhood into adulthood.[6] In his 1922 paper called A New Approach to the Study of Genius, Terman noted that this advancement in testing marked a change in research on geniuses and giftedness.[11] Previously, the research had looked at genius adults and tried to look in retrospect into their early years of childhood. Through these studies on gifted children, Terman hoped to find how to properly educate a gifted child as well as dispel the negative stereotypes that that gifted children were conceited, freakish, socially eccentric, and [insane].[12] Terman found his answers in his longitudinal study on gifted children called Genetic Studies of Genius which had five volumes.[13] The children in this study were called Termites.[7] The volumes reviewed the follow-ups that Terman conducted throughout their lives. The fifth volume was a 35 year follow-up, and looked at the gifted group during mid-life.[14] The results from this study showed that gifted and genius children were actually in good health and had normal personalities. Few of them demonstrated the previously-held negative stereotype of gifted children. Most of those in the study did well socially and academically and had lower divorce rates later in life.[7] Additionally, those in the gifted group were generally successful in their careers and had received awards recognizing their achievements. Though many of the Termites reached their potential in adulthood, some of the children did not, perhaps because of personal obstacles, insufficient education, or lack of opportunity.[6] Terman died before he completed the fifth volume of Genetic Studies of Genius, but Melita Oden, a colleague, completed the volume and published it.[14] Terman wished for the study to continue on after his death, so he selected Robert Richardson Sears, one of the many successful participants in the study as well as a colleague of his, to continue with the work.[7] The study is still supported by Stanford University and will continue until the last of the Termites withdraws from the study or dies.
383
Publications
The Measurement of Intelligence (1916) The Use of Intelligence Tests (1916) The Stanford Achievement Test (1923) Genetic Studies of Genius (1925, 1947, 1959) Autobiography of Lewis Terman (1930)
Recognition
Stanford University has an endowed professorship in his honor.
References
[1] [2] [3] [5] [6] [7] [8] Sears, R. R. (1957). L. M. Terman, pioneer in mental measurement. Science, 125, 978-979. doi:10.1126/science.125.3255.978 (http:/ / www. infoplease. com/ ce6/ people/ A0848220. html) Teigen, En psykologihistorie, page 235 (Vialle, 1994) Bernreuter, R. G., Miles, C.C., Tinker, M.A., & Young, K. (1942). Studies in personality. New York, NY: McGraw-Hill Book Company. Seagoe, M.V. (1975). Terman and the gifted. Los Altos, CA: William Kaufmann. Terman, L.M. (1906). Genius and stupidity: a study of some of the intellectual processes of seven 'bright' and seven 'stupid' boys. Pedagogical Seminary, 13, 307-373. [9] (Terman, 1915) [10] Terman, L.M. (1932). Autobiography. In C. Murchison (Ed.), A history of psychology, Vol.II (pp. 297-332). Worcester, MA; Clark University Press. [11] (Terman, 1922)
Lewis Terman
[12] Bernreuter, R. G., Miles, C.C., Tinker, M.A., & Young, K. (1942). Studies in personality. New York, NY: McGraw-Hill Book Company. p. 11 [13] Minton, 1988 [14] (Terman, 1959)
384
Bibliography
Minton, H.L. (1988). Lewis M. Terman: pioneer in psychology testing. New York, NY: New York University Press. Terman, L.M. (1915). The mental hygiene of exceptional children. Pedagogical Seminary, 22529-537. Terman, L.M. (1922). A new approach to the study of genius. Psychological Review, 29(4), 310-318. Terman, L.M. (Ed.). (1959). The gifted group at mid-life. Stanford, CA: Stanford University Press. Vialle, W. (1994). 'Termanal' science? The work of Lewis Terman revisited. Roeper Review, 17(1), 32-38. Human Intelligence: Lewis Madison Terman (http://www.indiana.edu/~intell/terman.shtml) Autobiography of Lewis M. Terman (http://psychclassics.asu.edu/Terman/murchison.htm). First published in Murchison, Carl. (Ed.) (1930). History of Psychology in Autobiography (Vol. 2, pp.297331). Republished by the permission of Clark University Press, Worcester, MA. Memorial Resolution Lewis Madison Terman (http://histsoc.stanford.edu/pdfmem/TermanL.pdf) via Stanford University Shurkin, Joel (1992). Terman's Kids: The Groundbreaking Study of How the Gifted Grow Up. Boston (MA): Little, Brown. ISBN978-0-316-78890-8. Lay summary (http://articles.latimes.com/1992-05-31/books/ bk-1247_1_lewis-terman/2) (28 June 2010).
External links
Works by Lewis Madison Terman (http://www.gutenberg.org/author/Lewis_Madison_Terman) at Project Gutenberg Lewis M. Terman, The Great Conspiracy or the Impulse Imperious of Intelligence Testers, Psychoanalyzed and Exposed by Mr. Lippmann, New Republic 33 (December 27, 1922): 116120. (http://historymatters.gmu.edu/ d/4960) "Psychological Predictors of Long Life: An 80-year study discovers traits that help people to live longer." (http:// www.psychologytoday.com/blog/looking-in-the-cultural-mirror/201206/psychological-predictors-long-life). Psychology Today. June 5, 2012.
Educational offices Precededby Knight Dunlap 32nd President of the American Psychological Association 19231924 Succeededby Granville Stanley Hall
Test (assessment)
385
Test (assessment)
A test or examination is an assessment intended to measure a test-taker's knowledge, skill, aptitude, physical fitness, or classification in many other topics (e.g., beliefs). A test may be administered orally, on paper, on a computer, or in a confined area that requires a test taker to physically perform a set of skills. Tests vary in style, rigor and requirements. For example, in a closed book test, a test taker is often required to rely upon memory to respond to specific items whereas in an open book test, a test taker may use one or more supplementary tools such as a reference book or calculator when responding to an item. A test may be administered formally or informally. An example of an informal test would be a reading test administered by a parent to a child. An example of a formal test would be a final examination administered by a teacher in a classroom or an I.Q. test administered by a psychologist in a clinic. Formal testing often results in a grade or a test score.[1] A test score may be interpreted with regards to a norm or criterion, or occasionally both. The norm may be established independently, or by statistical analysis of a large number of participants.
A standardized test is any test that is administered and scored in a consistent manner Cambodian students taking an exam in order to apply for the Don Bosco to ensure legal defensibility.[2] Standardized tests Technical School of Sihanoukville in 2008. are often used in education, professional certification, psychology (e.g., MMPI), the military, and many other fields. A non-standardized test is usually flexible in scope and format, variable in difficulty and significance. Since these tests are usually developed by individual instructors, the format and difficulty of these tests may not be widely adopted or used by other instructors or institutions. A non-standardized test may be used to determine the proficiency level of students, to motivate students to study, and to provide feedback to students. In some instances, a
Test (assessment)
386
teacher may develop non-standardized tests that resemble standardized tests in scope, format, and difficulty for the purpose of preparing their students for an upcoming standardized test.[] Finally, the frequency and setting by which a non-standardized tests are administered are highly variable and are usually constrained by the duration of the class period. A class instructor may for example, administer a test on a weekly basis or just twice a semester. Depending on the policy of the instructor or institution, the duration of each test itself may last for only five minutes to an entire class period.
In contrasts to non-standardized tests, standardized tests are widely used, fixed in terms of scope, difficulty and format, and are usually significant in consequences. Standardized tests are usually held on fixed dates as determined by the test developer, educational institution, or governing body, which may or may not be administered by the instructor, held within the classroom, or constrained by the classroom period. Although there is little variability between different copies of the same type of standardized test (e.g., SAT or GRE), there is variability between different types of standardized tests. Any test with important consequences for the individual test taker is referred to as a high-stakes test. A test may be developed and administered by an instructor, a clinician, a governing body, or a test provider. In some instances, the developer of the test may not be directly responsible for its administration. For example, Educational Testing Service (ETS), a nonprofit educational testing and assessment organization, develops standardized tests such as the SAT but may not directly be involved in the administration or proctoring of these tests. As with the development and administration of educational tests, the format and level of difficulty of the tests themselves are highly variable and there is no general consensus or invariable standard for test formats and difficulty. Often, the format and difficulty of the test is dependent upon the educational philosophy of the instructor, subject matter, class size, policy of the educational institution, and requirements of accreditation or governing bodies. In general, tests developed and administered by individual instructors are non-standardized whereas tests developed by testing organizations are standardized.
History
Ancient China was the first country in the world that implemented a nationwide standardized test, which was called the imperial examination. The main purpose of this examination was to select for able candidates for specific governmental positions.[3] The imperial examination was established by the Sui Dynasty in 605 AD and was later abolished by the Qing Dynasty 1300 years later in 1905. England had adopted this examination system in 1806 to select specific candidates for positions in Her Majesty's Civil Service. This examination system was later applied to education and it started to
Test (assessment) influence other parts of the world as it became a prominent standard (e.g. regulations to prevent the markers from knowing the identity of candidates), of delivering standardized tests. Influence of World Wars on Testing Both World War I and World War II made many people realize the necessity of standardized testing and the benefits associated with these tests. One main reason people saw the benefits was from the Army Alpha and Army Beta tests, which were used during WWI to determine human abilities. Alongside the Army Alpha, the Stanford-Binet Intelligence Scale "added momentum to the testing movement."[4] Soon after, colleges and industry began using tests to help in accepting and hiring people based on performance of the test. Another reason more tests began to come forth was that people were realizing that the distance between secondary education and higher education was widening after WWII. In 1952, the first Advanced Placement (AP) test was administered to begin closing the gap between high schools and colleges.[5]
387
Test (assessment)
388
Competitions
Tests are sometimes used as a tool to select for participants that have potential to succeed in a competition such as a sporting event. For example, serious skaters who wish to participate in figure skating competitions in the United States must pass official U.S. Figure Skating tests just to qualify.[]
Group memberships
Tests are sometimes used by a group to select for certain types of individuals to join the group. For example, Mensa International is a high I.Q. society that requires individuals to score at the 98th percentile or higher on a standardized, supervised IQ test.[]
Types of tests
Written tests
Written tests are tests that are administered on paper or on a computer. A test taker who takes a written test could respond to specific items by writing or typing within a given space of the test or on a separate form or document. In some tests; where knowledge of many constants or technical terms is required to effectively answer questions, like Chemistry or Biology the test developer may allow every test taker to bring with them a cheat sheet. A test developer's choice of which style or format to use when Indonesian Students taking a written test developing a written test is usually arbitrary given that there is no single invariant standard for testing. Be that as it may, certain test styles and format have become more widely used than others. Below is a list of those formats of test items that are widely used by educators and test developers to construct paper or computer-based tests. As a result, these tests may consist of only one type of test item format (e.g., multiple choice test, essay test) or may have a combination of different test item formats (e.g., a test that has multiple choice and essay items). Multiple choice In a test that has items formatted as multiple choice questions, a candidate would be given a number of set answers for each question, and the candidate must choose which answer or group of answers is correct. There are two families of multiple choice questions.[] The first family is known as the True/False question and it requires a test taker to choose all answers that are appropriate. The second family is known as One-Best-Answer question and it requires a test taker to answer only one from a list of answers.
Test (assessment) There are several reasons to using multiple choice questions in tests. In terms of administration, multiple choice questions usually requires less time for test takers to answer, are easy to score and grade, provide greater coverage of material, allows for a wide range of difficulty, and can easily diagnose a test taker's difficulty with certain concepts.[] As an educational tool, multiple choice items test many levels of learning as well as a test taker's ability to integrate information, and it provides feedback to the test taker about why distractors were wrong and why correct answers were right. Nevertheless, there are difficulties associated with the use of multiple choice questions. In administrative terms, multiple choice items that are effective usually take a great time to construct.[] As an educational tool, multiple choice items do not allow test takers to demonstrate knowledge beyond the choices provided and may even encourage guessing or approximation due to the presence of at least one correct answer. For instance a test taker might not work out explicitly that , but knowing that , they would choose an answer close to 48. Moreover, test takers may misinterpret these items and in the process, perceive these items to be tricky or picky. Finally, multiple choice items do not test a test taker's attitudes towards learning because correct responses can be easily faked. Alternative response True/False questions present candidates with a binary choice - a statement is either true or false. This method presents problems, as depending on the number of questions, a significant number of candidates could get 100% just by guesswork, and should on average get 50%. Matching type A matching item is an item that provides a defined term and requires a test taker to match identifying characteristics to the correct term.[] Completion type A fill-in-the-blank item provides a test taker with identifying characteristics and requires the test taker to recall the correct term.[] There are two types of fill-in-the-blank tests. The easier version provides a word bank of possible words that will fill in the blanks. For some exams all words in the word bank are exactly once. If a teacher wanted to create a test of medium difficulty, they would provide a test with a word bank, but some words may be used more than once and others not at all. The hardest variety of such a test is a fill-in-the-blank test in which no word bank is provided at all. This generally requires a higher level of understanding and memory than a multiple choice test. Because of this, fill-in-the-blank tests[with no word bank] are often feared by students. Essay Items such as short answer or essay typically require a test taker to write a response to fulfill the requirements of the item. In administrative terms, essay items take less time to construct.[] As an assessment tool, essay items can test complex learning objectives as well as processes used to answer the question. The items can also provide a more realistic and generalizable task for test. Finally, these items make it difficult for test takers to guess the correct answers and require test takers to demonstrate their writing skills as well as correct spelling and grammar. The difficulties with essay items is primarily administrative. For one, these items take more time for test takers to answer.[] When these questions are answered, the answers themselves are usually poorly written because test takers may not have time to organize and proofread their answers. In turn, it takes more time to score or grade these items. When these items are being scored or graded, the grading process itself becomes subjective as non-test related information may influence the process. Thus, considerable effort is required to minimize the subjectivity of the grading process. Finally, as an assessment tool, essay questions may potentially be unreliable in assessing the entire content of a subject matter.
389
Test (assessment) Mathematical questions Most mathematics questions, or calculation questions from subjects such as chemistry, physics or economics employ a style which does not fall in to any of the above categories, although some papers, notably the Maths Challenge papers in the United Kingdom employ multiple choice. Instead, most mathematics questions state a mathematical problem or exercise that requires a student to write a freehand response. Marks are given more for the steps taken than for the correct answer. If the question has multiple parts, later parts may use answers from previous sections, and marks may be granted if an earlier incorrect answer was used but the correct method was followed, and an answer which is correct (given the incorrect input) is returned. Higher level mathematical papers may include variations on true/false, where the candidate is given a statement and asked to verify its validity by direct proof or stating a counterexample.
390
Performance tests
A performance test is an assessment that requires an examinee to actually perform a task or activity, rather than simply answering questions referring to specific parts. The purpose is to ensure greater fidelity to what is being tested. An example is a behind-the-wheel driving test to obtain a driver's license. Rather than only answering simple multiple-choice items regarding the driving of an automobile, a student is required to actually drive one while being evaluated. Performance tests are commonly used in workplace and professional applications, such as professional certification and licensure. When used for personnel selection, the tests might be referred to as a work sample. A licensure example would be cosmetologists being required to demonstrate a haircut or manicure on a live person. The Group-Bourdon test is one of a number of psychometric tests which trainee train drivers in the UK are required to pass.[12] Some performance tests are simulations. For instance, the assessment to become certified as an ophthalmic technician includes two components, a multiple-choice examination and a computerized skill simulation. The
Test (assessment) examinee must demonstrate the ability to complete seven tasks commonly performed on the job, such as retinoscopy, that are simulated on a computer.
391
Test preparations
From the perspective of a test developer, there is great variability with respect to time and effort needed to prepare a test. Likewise, from the perspective of a test taker, there is also great variability with respect to the time and needed to obtain a desired grade or score on any given test. When a test developer constructs a test, the amount of time and effort is dependent upon the significance of the test itself, the proficiency of the test taker, the format of the test, class size, deadline of test, and experience of the test developer. The process of test construction has been greatly aided in several ways. For one, many test developers were themselves students at one time, and therefore are able to modify or outright adopt test questions from their previous tests. In some countries such as the United States, book publishers often provide teaching packages that include test banks to university instructors who adopt their published books for their courses. These test banks may contain up to four thousand sample test questions that have been peer-reviewed and time tested.[13] The instructor who chooses to use this testbank would only have to select a fixed number of test questions from this test bank to construct a test. As with test constructions, the time needed for a test taker to prepare for a test is dependent upon the frequency of the test, the test developer, and the significance of the test. In general, nonstandardized tests that are short, frequent, and do not constitute a major portion of the test taker's overall course grade or score require do not require the test taker to spend great amounts preparing for the test.[] Conversely, nonstandardized tests that are long, infrequent, and do constitute a major portion of the test taker's overall course grade or score usually require the test taker to spend great amounts preparing for the test. To prepare for a nonstandardized test, test takers may rely upon their reference books, class or lecture notes, Internet, and past experience to prepare for the test. Test takers may also use various learning aids to study for tests such as flash cards and mnemonics.[14] Test takers may even hire tutors to coach them through the process so that they may increase the probability of obtaining a desired test grade or score. Finally, test takers may rely upon past copies of a test from previous years or semesters to study for a future test. These past tests may be provided by a friend or a group that has copies of previous tests or from instructors and their institutions.[15] Unlike nonstandardized test, the time needed by test takers to prepare for standardized tests are less variable and usually considerable. This is because standardized tests are usually uniformed in scope, format, and difficulty and often have important consequences with respect to a test taker's future such as a test taker's eligibility to attend a specific university program or to enter a desired profession. It is not unusual for test takers to prepare for standardized tests by relying upon commercially available books that provide in-depth coverage of the standardized test or compilations of previous tests (e.g., 10 year series in Singapore). In many countries, test takers even enroll in test preparation centers or cram schools that provide extensive or supplementary instructions to test takers to help them better prepare for a standardized test. Finally, in some countries, instructors and their institutions have also played a significant role in preparing test takers for a standardized test.
Cheating on tests
Cheating on a test is the process of using unauthorized means or methods for the purpose of obtaining a desired test score or grade. This may range from bringing and using notes during a closed book examination, copying another test taker's answer or choice of answers during an individual test, or even sending a paid proxy to take the test. Several common methods have been employed to combat cheating. They include the use of multiple proctors or invigilators during a testing period to monitor test takers. Test developers may construct multiple variants of the same test to be administered to different test takers at the same time. In some cases, instructors themselves may not administer their own tests but will leave the task to other instructors or invigilators, which may mean that the invigilators do not know the candidates, and thus some form of identification may be required. Finally, instructors or
Test (assessment) test providers may compare the answers of suspected cheaters on the test themselves to determine if cheating did occur.
392
Critics of standardized tests in education often provide the following reasons for revising or removing standardized tests in education: Narrows curricular format and encourages teaching to the test.[] Poor predictive quality.[16] Grade inflation of test scores or grades.[17][18][19] Culturally or socioeconomically biased.[20]
References
[1] [2] [3] [4] [5] [8] Thissen, D., & Wainer, H. (2001). Test Scoring. Mahwah, NJ: Erlbaum. Page 1, sentence 1. North Central Regional Educational Laboratory, NCREL.org (http:/ / www. ncrel. org/ sdrs/ areas/ issues/ students/ earlycld/ ea5lk3. htm) Advanced Level Examination, Chinese Language and Culture, Paper 1A Kaplan, R.M., & Saccuzzo, D.P. (2009) Psychological Testing. Belmont, CA: Wadsworth http:/ / www. collegeboard. com/ prod_downloads/ about/ news_info/ ap/ ap_history_english. pdf Name changed in 1996.
Further reading
Airasian, P. (1994) "Classroom Assessment," Second Edition, NY" McGraw-Hill. Cangelosi, J. (1990) "Designing Tests for Evaluating Student Achievement." NY: Addison-Wesley. Gronlund, N. (1993) "How to make achievement tests and assessments," 5th edition, NY: Allyn and Bacon. Haladyna, T.M. & Downing, S.M. (1989) Validity of a Taxonomy of Multiple-Choice Item-Writing Rules. "Applied Measurement in Education," 2(1), 51-78. Monahan, T. (1998) The Rise of Standardized Educational Testing in the U.S. A Bibliographic Overview (http:/ /torinmonahan.com/papers/testing.pdf). Ravitch, Diane, The Uses and Misuses of Tests (http://www.dianeravitch.com/uses_and_misuses.pdf), in The Schools We Deserve (New York: Basic Books, 1985), pp.172181. Wilson, N. (1997) Educational standards and the problem of error. Education Policy Analysis Archives, Vol 6 No 10 (http://epaa.asu.edu/epaa/v6n10/)
Test (assessment)
393
External links
"About the Joint Committee on Testing Practices" (http://www.apa.org/science/programs/testing/committee. aspx). http://www.apa.org: American Psychological Association. Retrieved 2 Aug 2011. "The Joint Committee on Testing Practices (JCTP) was established in 1985 by the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME). In 2007 the JCTP disbanded, but JCTP publications are still available and may be obtained by contacting any of the groups listed in the product descriptions shown below."
Test score
A test score is a piece of information, usually a number, that conveys the performance of an examinee on a test. One formal definition is that it is "a summary of the evidence contained in an examinee's responses to the items of a test that are related to the construct or constructs being measured."[1] Test scores are interpreted with a norm-referenced or criterion-referenced interpretation, or occasionally both. A norm-referenced interpretation means that the score conveys meaning about the examinee with regards to their standing among other examinees. A criterion-referenced interpretation means that the score conveys information about the examinee with regards a specific subject matter, regardless of other examinees' scores.[2]
References
[1] Thissen, D., & Wainer, H. (2001). Test Scoring. Mahwah, NJ: Erlbaum. Page 1, sentence 1. [2] Iowa Testing Programs guide for interpreting test scores (http:/ / www. education. uiowa. edu/ itp/ itbs/ itbs_interp_score. htm)
394
Historical overview
In the 1930s, the British Association for the Advancement of Science established the Ferguson Committee to investigate the possibility of psychological attributes being measured scientifically. The British physicist and measurement theorist Norman Robert Campbell was an influential member of the committee. In its Final Report (Ferguson, et al., 1940), Campbell and the Committee concluded that because psychological attributes were not capable of sustaining concatenation operations, such attributes could not be continuous quantities. Therefore, they could not be measured scientifically. This had important ramifications for psychology, the most significant of these being the creation in 1946 of the operational theory of measurement by Harvard psychologist Stanley Smith Stevens. Stevens' non-scientific theory of measurement is widely held as definitive in psychology and the behavioural sciences generally (Michell 1999). Whilst the German mathematician Otto Hlder (1901) anticipated features of the theory of conjoint measurement, it was not until the publication of Luce & Tukey's seminal 1964 paper that the theory received its first complete exposition. Luce & Tukey's presentation was algebraic and is therefore considered more general than Debreu's (1960) topological work, the latter being a special case of the former (Luce & Suppes 2002). In the first article of the inaugural issue of the Journal of Mathematical Psychology, Luce & Tukey 1964 proved that via the theory of conjoint measurement, attributes not capable of concatenation could be quantified. N.R. Campbell and the Ferguson Committee were thus proven wrong. That a given psychological attribute is a continuous quantity is a logically coherent and empirically testable hypothesis.
Theory of conjoint measurement Appearing in the next issue of the same journal were important papers by Dana Scott (1964), who proposed a hierarchy of cancellation conditions for the indirect testing of the solvability and Archimedean axioms, and David Krantz (1964) who connected the Luce & Tukey work to that of Hlder (1901). Work soon focused on extending the theory of conjoint measurement to involve more than just two attributes. Krantz 1968 and Amos Tversky (1967) developed what became known as polynomial conjoint measurement, with Krantz 1968 providing a schema with which to construct conjoint measurement structures of three or more attributes. Later, the theory of conjoint measurement (in its two variable, polynomial and n-component forms) received a thorough and highly technical treatment with the publication of the first volume of Foundations of Measurement, which Krantz, Luce, Tversky and philosopher Patrick Suppes cowrote (Krantz et al. 1971). Shortly after the publication of Krantz, et al., (1971), work focused upon developing an "error theory" for the theory of conjoint measurement. Studies were conducted into the number of conjoint arrays that supported only single cancellation and both single and double cancellation (Arbuckle & Larimer 1976; McClelland 1977). Later enumeration studies focused on polynomial conjoint measurement (Karabatsos & Ullrich 2002; Ullrich & Wilson 1993). These studies found that it is highly unlikely that the axioms of the theory of conjoint measurement are satisfied at random, provided that more than three levels of at least one of the component attributes has been identified. Joel Michell (1988) later identified that the "no test" class of tests of the double cancellation axiom was empty. Any instance of double cancellation is thus either an acceptance or a rejection of the axiom. Michell also wrote at this time a non-technical introduction to the theory of conjoint measurement (Michell 1990) which also contained a schema for deriving higher order cancellation conditions based upon Scott's (1964) work. Using Michell's schema, Ben Richards (Kyngdon & Richards, 2007) discovered that some instances of the triple cancellation axiom are "incoherent" as they contradict the single cancellation axiom. Moreover, he identified many instances of the triple cancellation which are trivially true if double cancellation is supported. The axioms of the theory of conjoint measurement are not stochastic; and given the ordinal constraints placed on data by the cancellation axioms, order restricted inference methodology must be used (Iverson & Falmagne 1985). George Karabatsos and his associates (Karabatsos, 2001; Karabatsos & Sheu 2004) developed a Bayesian Markov chain Monte Carlo methodology for psychometric applications. Karabatsos & Ullrich 2002 demonstrated how this framework could be extended to polynomial conjoint structures. Karabatsos (2005) generalised this work with his multinomial Dirichlet framework, which enabled the probabilistic testing of many non-stochastic theories of mathematical psychology. More recently, Clintin Davis-Stober (2009) developed a frequentist framework for order restricted inference that can also be used to test the cancellation axioms. Perhaps the most notable (Kyngdon, 2011) use of the theory of conjoint measurement was in the prospect theory proposed by the Israeli - American psychologists Daniel Kahneman and Amos Tversky (Kahneman & Tversky, 1979). Prospect theory was a theory of decision making under risk and uncertainty which accounted for choice behaviour such as the Allais Paradox. David Krantz wrote the formal proof to prospect theory using the theory of conjoint measurement. In 2002, Kahneman received the Nobel Memorial Prize in Economics for prospect theory (Birnbaum, 2008).
395
Theory of conjoint measurement For some other quantities, it is easier or has been convention to estimate ratios between attribute differences. Consider temperature, for example. In the familiar everyday instances, temperature is measured using instruments calibrated in either the Fahrenheit or Celsius scales. What are really being measured with such instruments are the magnitudes of temperature differences. For example, Anders Celsius defined the unit of the Celsius scale to be 1/100th of the difference in temperature between the freezing and boiling points of water at sea level. A midday temperature measurement of 20 degrees Celsius is simply the ratio of the Celsius unit to the midday temperature. Formally expressed, a scientific measurement is:
396
where Q is the magnitude of the quantity, r is a real number and [Q] is a unit magnitude of the same kind. This classical/standard definition of measurement does not take into account that measurement in one physical realm is affected by other physical realms as demonstrated by the Heisenberg uncertainty principle, and Einstein's theories of Special and General Relativity. For instance, we know from Boyle's law that measurement of volume is affected by temperature, pressure, etc. A gallon of gasoline measured in winter, will expand in volume by summer and vice versa. The definition of temperature in degrees Celsius itself is based upon the boiling temperature of water AT SEA LEVEL, but do we usually account for this in our measurement of temperature? We also know from Einstein's theories that length is not constant for any object in motion, and all objects in the universe are under varying motion. Similarly for time. Therefore it is not possible for any measurement (physical or psychological) to be the ratio between a magnitude of a continuous quantity and a unit magnitude of the same kind.
Theory
Consider two natural attributes A, and X. It is not known that either A or X is a continuous quantity, or that both of them are. Let a, b, and c represent three independent, identifiable levels of A; and let x, y and z represent three independent, identifiable levels of X. A third attribute, P, consists of the nine ordered pairs of levels of A and X. That is, (a, x), (b, y),..., (c, z) (see Figure 1). The quantification of A, X and P depends upon the behaviour of the relation holding upon the levels of P. These are relations are presented as axioms in the theory of conjoint measurement.
Theory of conjoint measurement Single cancellation or independence axiom The single cancellation axiom is as follows. The relation upon P satisfies single cancellation if and only if for all a and b in A, and x in X, (a, x) > (b, x) is implied for every w in X such that (a, w) > (b, w). Similarly, for all x and y in X and a in A, (a, x) > (a, y) is implied for every d in A such that (d, x) > (d, y). What this means is that if any two levels, a, b, are ordered, then this Figure One: Graphical representation of the single cancellation axiom. It can be seen that order holds irrespective of each and a > b because (a, x) > (b, x), (a, y) > (b, y) and (a, z) > (b, z). every level of X. The same holds for any two levels, x and y of X with respect to each and every level of A. Single cancellation is so-called because a single common factor of two levels of P cancel out to leave the same ordinal relationship holding on the remaining elements. For example, a cancels out of the inequality (a, x) > (a, y) as it is common to both sides, leaving x > y. Krantz, et al., (1971) originally called this axiom independence, as the ordinal relation between two levels of an attribute is independent of any and all levels of the other attribute. However, given that the term independence causes confusion with statistical concepts of independence, single cancellation is the preferable term. Figure One is a graphical representation of one instance of single cancellation. Satisfaction of the single cancellation axiom is necessary, but not sufficient, for the quantification of attributes A and X. It only demonstrates that the levels of A, X and P are ordered. Informally, single cancellation does not sufficiently constrain the order upon the levels of P to quantify A and X. For example, consider the ordered pairs (a, x), (b, x) and (b, y). If single cancellation holds then (a, x) > (b, x) and (b, x) > (b, y). Hence via transitivity (a, x) > (b, y). The relation between these latter two ordered pairs, informally a left-leaning diagonal, is determined by the satisfaction of the single cancellation axiom, as are all the "left leaning diagonal" relations upon P. Double cancellation axiom Single cancellation does not determine the order of the "right-leaning diagonal" relations upon P. Even though by transitivity and single cancellation it was established that (a, x) > (b, y), the relationship between (a, y) and (b, x) remains undetermined. It could be that either (b, x) > (a, y) or (a, y) > (b, x) and such ambiguity cannot remain unresolved. The double cancellation axiom inequality (broken line arrow) does not contradict the direction of both antecedent concerns a class of such relations upon inequalities (solid line arrows), so supporting the axiom. P in which the common terms of two antecedent inequalities cancel out to produce a third inequality. Consider the instance of double cancellation graphically represented by Figure Two. The antecedent inequalities of this particular instance of double cancellation are:
Figure Two: A Luce - Tukey instance of double cancellation, in which the consequent
397
398
; and
, it follows that: .
Cancelling the common terms results in: . Hence double cancellation can only obtain when A and X are quantities. Double cancellation is satisfied if and only if the consequent inequality does not contradict the antecedent inequalities. For example, if the consequent inequality above was: , or alternatively, , then double cancellation would be violated (Michell 1988) and it could not be concluded that A and X are quantities. Double cancellation concerns the behaviour of the "right leaning diagonal" relations on P as these are not logically entailed by single cancellation. (Michell 2009) discovered that when the levels of A and X approach infinity, then the number of right leaning diagonal relations is half of the number of total relations upon P. Hence if A and X are quantities, half of the number of relations upon P are due to ordinal relations upon A and X and half are due to additive relations upon A and X (Michell 2009). The number of instances of double cancellation is contingent upon the number of levels identified for both A and X. If there are n levels of A and m of X, then the number of instances of double cancellation is n! m!. Therefore, if n = m = 3, then 3! 3! = 6 6 = 36 instances in total of double cancellation. However, all but 6 of these instances are trivially true if single cancellation is true, and if anyone of these 6 instances is true, then all of them are true. One such instance is that shown in Figure Two. (Michell 1988) calls this a Luce Tukey instance of double cancellation. If single cancellation has been tested upon a set of data first and is established, then only the Luce Tukey instances of double cancellation need to be tested. For n levels of A and m of X, the number of Luce Tukey double cancellation instances is . For example, if n = m = 4, then there are 16 such instances. If n = m = 5 then there are 100. The greater the number of levels in both A and X, the less probable it is that the cancellation axioms are satisfied at random (Arbuckle & Larimer 1976; McClelland 1977) and the more stringent test of quantity the application of conjoint measurement becomes.
Theory of conjoint measurement Solvability and Archimedean axioms The single and double cancellation axioms by themselves are not sufficient to establish continuous quantity. Other conditions must also be introduced to ensure continuity. These are the solvability and Archimedean conditions. Solvability means that for any three elements of a, b, x and y, the fourth exists such that the equation a x = b y is solved, hence the name of the Figure Three: An instance of triple cancellation. condition. Solvability essentially is the requirement that each level P has an element in A and an element in X. Solvability reveals something about the levels of A and X they are either dense like the real numbers or equally spaced like the integers (Krantz et al. 1971). The Archimedean condition is as follows. Let I be a set of consecutive integers, either finite or infinite, positive or negative. The levels of A form a standard sequence if and only if there exists x and y in X where x y and for all integers i and i + 1 in I: . What this basically means is that if x is greater than y, for example, there are levels of A which can be found which makes two relevant ordered pairs, the levels of P, equal. The Archimedean condition argues that there is no infinitely greatest level of P and so hence there is no greatest level of either A or X. This condition is a definition of continuity given by the ancient Greek mathematician Archimedes whom wrote that "Further, of unequal lines, unequal surfaces, and unequal solids, the greater exceeds the less by such a magnitude as, when added to itself, can be made to exceed any assigned magnitude among those which are comparable with one another " (On the Sphere and Cylinder, Book I, Assumption 5). Archimedes recognised that for any two magnitudes of a continuous quantity, one being lesser than the other, the lesser could be multiplied by a whole number such that it equalled the greater magnitude. Euclid stated the Archimedean condition as an axiom in Book V of the Elements, in which Euclid presented his theory of continuous quantity and measurement. As they involve infinitistic concepts, the solvability and Archimedean axioms are not amenable to direct testing in any finite empirical situation. But this does not entail that these axioms cannot be empirically tested at all. Scott's (1964) finite set of cancellation conditions can be used to indirectly test these axioms; the extent of such testing being empirically determined. For example, if both A and X possess three levels, the highest order cancellation axiom within Scott's (1964) hierarchy that indirectly tests solvability and Archimedeaness is double cancellation. With four levels it is triple cancellation (Figure 3). If such tests are satisfied, the construction of standard sequences in differences upon A and X are possible. Hence these attributes may be dense as per the real numbers or equally spaced as per the integers (Krantz et al. 1971). In other words, A and X are continuous quantities.
399
400
interval scale in Stevens (1946) parlance). The mathematical proof of this result is given in (Krantz et al. 1971, pp.2616). This means that the levels of A and X are magnitude differences measured relative to some kind of unit difference. Each level of P is a difference between the levels of A and X. However, it is not clear from the literature as to how a unit could be defined within an additive conjoint context. van der Ven 1980 proposed a scaling method for conjoint structures but he also did not discuss the unit. The theory of conjoint measurement, however, is not restricted to the quantification of differences. If each level of P is a product of a level of A and a level of X, then P is another different quantity whose measurement is expressed as a magnitude of A per unit magnitude of X. For example, A consists of masses and X consists of volumes, then P consists of densities measured as mass per unit of volume. In such cases, it would appear that one level of A and one level of X must be identified as a tentative unit prior to the application of conjoint measurement. If each level of P is the sum of a level of A and a level of X, then P is the same quantity as A andX. For example, A and X are lengths so hence must be P. All three must therefore be expressed in the same unit. In such cases, it would appear that a level of either A or X must be tentatively identified as the unit. Hence it would seem that application of conjoint measurement requires some prior descriptive theory of the relevant natural system.
Theory of conjoint measurement not an unlikely event. Kyngdon & Richards (2007) employed eight statements and found the interstimulus midpoint orders rejected the double cancellation condition. Perline, Wright & Wainer 1979 applied conjoint measurement to item response data to a convict parole questionnaire and to intelligence test data gathered from Danish troops. They found considerable violation of the cancellation axioms in the parole questionnaire data, but not in the intelligence test data. Moreover, they recorded the supposed "no - test" instances of double cancellation. Interpreting these correctly as instances in support of double cancellation (Michell, 1988), the results of Perline, Wright & Wainer 1979 are better than what they believed. Stankov & Cregan 1993 applied conjoint measurement to performance on sequence completion tasks. The columns of their conjoint arrays (X) were defined by the demand placed upon working memory capacity through increasing numbers of working memory place keepers in letter series completion tasks. The rows were defined by levels of motivation (A), which consisted in different amount of times available for compelting the test. Their data (P) consisted of completion times and average number of series correct. They found support for the cancellation axioms, however, their study was biased by the small size of the conjoint arrays (3 3 is size) and by statistical techniques that did not take into consideration the ordinal restrictions imposed by the cancellation axioms. Kyngdon (2011) used Karabatsos' (2001) order restricted inference framework to test a conjoint matrix of reading item response proportions (P) where the examinee reading ability comprised the rows of the conjoint array (A) and the difficulty of the reading items formed the columns of the array (X). The levels of reading ability were identified via raw total test score and the levels of reading item difficulty were identified by the Lexile Framework for Reading (Stenner et al. 2006). Kyngdon found that satisfaction of the cancellation axioms was obtained only through permutation of the matrix in a manner inconsistent with the putative Lexile measures of item difficulty. Kyngdon also tested simulated ability test response data using polynomial conjoint measurement. The data were generated using Humphry's extended frame of reference Rasch model (Humphry & Andrich 2008). He found support of distributive, single and double cancellation consistent with a distributive polynomial conjoint structure in three variables (Krantz & Tversky 1971).
401
References
Arbuckle, J.; Larimer, J. (1976). "The number of two-way tables satisfying certain additivity axioms". Journal of Mathematical Psychology 12: 89100. doi:10.1016/0022-2496(76)90036-5 [1]. Birnbaum, M.H. (2008). "New paradoxes of risky decision making". Psychological Review 115 (2): 463501. doi:10.1037/0033-295X.115.2.463 [2]. PMID18426300 [3]. Brogden, H.E. (December 1977). "The Rasch model, the law of comparative judgement and additive conjoint measurement" [4]. Psychometrika 42 (4): 6314. doi:10.1007/BF02295985 [5]. Cliff, N. (1992). "Abstract measurement theory and the revolution that never happened". Psychological Science 3 (3): 186190. doi:10.1111/j.1467-9280.1992.tb00024.x [6]. Coombs, C.H. (1964). A Theory of Data. New York: Wiley.Wikipedia:Citing sources Davis-Stober, C.P. (February 2009). "Analysis of multinomial models under inequality constraints: applications to measurement theory" [7]. Journal of Mathematical Psychology 53 (1): 113. doi:10.1016/j.jmp.2008.08.003 [8]. Debreu, G. (1960). "Topological methods in cardinal utility theory". In Arrow, K.J.; Karlin, S.; Suppes, P. Mathematical Methods in the Social Sciences. Stanford University Press. pp.1626. Embretson, S.E.; Reise, S.P. (2000). Item response theory for psychologists. Erlbaum.Wikipedia:Citing sources Emerson, W.H. (2008). "On quantity calculus and units of measurement". Metrologia 45 (2): 134138. Bibcode:2008Metro..45..134E [9]. doi:10.1088/0026-1394/45/2/002 [10]. Fischer, G. (1995). "Derivations of the Rasch model". In Fischer, G.; Molenaar, I.W. Rasch models: Foundations, recent developments, and applications. New York: Springer. pp.1538. Gigerenzer, G.; Strube, G. (1983). "Are there limits to binaural additivity of loudness?". Journal of Experimental Psychology: Human Perception and Performance 9: 126136. doi:10.1037/0096-1523.9.1.126 [11].
Theory of conjoint measurement Grayson, D.A. (September 1988). "Two-group classification and latent trait theory: scores with monotone likelihood ratio" [12]. Psychometrika 53 (3): 383392. doi:10.1007/BF02294219 [13]. Hlder, O. (1901). "Die Axiome der Quantitt und die Lehre vom Mass". Berichte uber die Verhandlungen der Koeniglich Sachsischen Gesellschaft der Wissenschaften zu Leipzig, Mathematisch-Physikaliche Klasse 53: 146. (Part 1 translated by Michell, J.; Ernst, C. (September 1996). "The axioms of quantity and the theory of measurement" [14]. Journal of Mathematical Psychology 40 (3): 235252. doi:10.1006/jmps.1996.0023 [15]. PMID8979975 [16]. Humphry, S.M.; Andrich, D. (2008). "Understanding the unit in the Rasch model". Journal of Applied Measurement 9 (3): 249264. PMID18753694 [17]. Iverson, G.; Falmagne, J.C. (1985). "Statistical issues in measurement". Mathematical Social Sciences 10 (2): 131153. doi:10.1016/0165-4896(85)90031-9 [18]. Johnson, T. (2001). "Controlling the effect of stimulus context change on attitude statements using Michell's binary tree procedure". Australian Journal of Psychology 53: 2328. doi:10.1080/00049530108255118 [19]. Kahneman, D.; Tversky, A. (1979). "Prospect theory: an analysis of decision under risk". Econometrica 47 (2): 263291. doi:10.2307/1914185 [20]. Karabatsos, G. (2001). "The Rasch model, additive conjoint measurement, and new models of probabilistic measurement theory". Journal of Applied Measurement 2 (4): 389423. PMID12011506 [21]. Karabatsos, G. (February 2005). "The exchangeable multinomial model as an approach for testing axioms of choice and measurement" [22]. Journal of Mathematical Psychology 49 (1): 5169. doi:10.1016/j.jmp.2004.11.001 [23]. Karabatsos, G.; Sheu, C.F. (2004). "Bayesian order constrained inference for dichotomous models of unidimensional non-parametric item response theory". Applied Psychological Measurement 28 (2): 110125. doi:10.1177/0146621603260678 [24]. Karabatsos, G.; Ullrich, J.R. (2002). "Enumerating and testing conjoint measurement models". Mathematical Social Sciences 43 (3): 485504. doi:10.1016/S0165-4896(02)00024-0 [25]. Krantz, D.H. (July 1964). "Conjoint measurement: the Luce Tukey axiomatisation and some extensions" [26]. Journal of Mathematical Psychology 1 (2): 248277. doi:10.1016/0022-2496(64)90003-3 [27]. Krantz, D.H. (1968). "A survey of measurement theory". In Danzig, G.B.; Veinott, A.F. Mathematics of the Decision Sciences: Part 2. Providence, Rhode Island: American Mathematical Society. pp.314350. Keats, J.A. (1967). "Test theory". Annual Review of Psychology 18: 217238. doi:10.1146/annurev.ps.18.020167.001245 [28]. PMID5333423 [29]. Kline, P. (1998). The New Psychometrics: Science, psychology and measurement. London: Routledge.Wikipedia:Citing sources Krantz, D.H.; Luce, R.D; Suppes, P.; Tversky, A. (1971). Foundations of Measurement, Vol. I: Additive and polynomial representations. New York: Academic Press. Krantz, D.H.; Tversky, A. (1971). "Conjoint measurement analysis of composition rules in psychology". Psychological Review 78 (2): 151169. doi:10.1037/h0030637 [30]. Kyngdon, A. (2006). "An empirical study into the theory of unidimensional unfolding". Journal of Applied Measurement 7 (4): 369393. PMID17068378 [31]. Kyngdon, A. (2008). "The Rasch model from the perspective of the representational theory of measurement". Theory & Psychology 18: 89109. doi:10.1177/0959354307086924 [32]. Kyngdon, A. (2011). "Plausible measurement analogies to some psychometric models of test performance". British Journal of Mathematical and Statistical Psychology 64 (3): 478497. doi:10.1348/2044-8317.002004 [33]. PMID21973097 [34]. Kyngdon, A.; Richards, B. (2007). "Attitudes, order and quantity: deterministic and direct probabilistic tests of unidimensional unfolding". Journal of Applied Measurement 8 (1): 134. PMID17215563 [35].
402
Theory of conjoint measurement Levelt, W.J.M.; Riemersma, J.B.; Bunt, A.A. (May 1972). "Binaural additivity of loudness" [36]. British Journal of Mathematical and Statistical Psychology 25 (1): 5168. doi:10.1111/j.2044-8317.1972.tb00477.x [37]. PMID5031649 [38]. Luce, R.D.; Suppes, P. (2002). "Representational measurement theory". In Pashler, H.; Wixted, J. Stevens handbook of experimental psychology: Vol. 4. Methodology in experimental psychology (3rd ed.). New York: Wiley. pp.141. Luce, R.D.; Tukey, J.W. (January 1964). "Simultaneous conjoint measurement: a new scale type of fundamental measurement" [39]. Journal of Mathematical Psychology 1 (1): 127. doi:10.1016/0022-2496(64)90015-X [40]. McClelland, G. (June 1977). ""A note on Arbuckle and Larimer: the number of two way tables satisfying certain additivity axioms"" [41]. Journal of Mathematical Psychology 15 (3): 2925. doi:10.1016/0022-2496(77)90035-9 [42] . Michell, J. (June 1994). "Measuring dimensions of belief by unidimensional unfolding" [43]. Journal of Mathematical Psychology 38 (2): 224273. doi:10.1006/jmps.1994.1016 [44]. Michell, J. (December 1988). "Some problems in testing the double cancellation condition in conjoint measurement" [45]. Journal of Mathematical Psychology 32 (4): 466473. doi:10.1016/0022-2496(88)90024-7 [46] . Michell, J. (1990). An Introduction to the Logic of Psychological Measurement. Hillsdale NJ: Erlbaum.Wikipedia:Citing sources Michell, J. (February 2009). "The psychometricians' fallacy: Too clever by half?" [47]. British Journal of Mathematical and Statistical Psychology 62 (1): 4155. doi:10.1348/000711007X243582 [48]. Perline, R.; Wright, B.D; Wainer, H. (1979). "The Rasch model as additive conjoint measurement". Applied Psychological Measurement 3 (2): 237255. doi:10.1177/014662167900300213 [49]. Scheiblechner, H. (September 1999). "Additive conjoint isotonic probabilistic models (ADISOP)" [50]. Psychometrika 64 (3): 295316. doi:10.1007/BF02294297 [51]. Scott, D. (July 1964). "Measurement models and linear inequalities" [52]. Journal of Mathematical Psychology 1 (2): 233247. doi:10.1016/0022-2496(64)90002-1 [53]. Sherman, K. (April 1994). "The effect of change in context in Coombs's unfolding theory" [54]. Australian Journal of Psychology 46 (1): 4147. doi:10.1080/00049539408259468 [55]. Stankov, L.; Cregan, A. (1993). "Quantitative and qualitative properties of an intelligence test: series completion". Learning and Individual Differences 5 (2): 137169. doi:10.1016/1041-6080(93)90009-H [56]. Stenner, A.J.; Burdick, H.; Sanford, E.E.; Burdick, D.S. (2006). "How accurate are Lexile text measures?". Journal of Applied Measurement 7 (3): 307322. PMID16807496 [57]. Stevens, S.S. (1946). "On the theory of scales of measurement". Science 103 (2684): 667680. Bibcode:1946Sci...103..677S [58]. doi:10.1126/science.103.2684.677 [59]. PMID17750512 [60]. Stober, C.P. (2009). Luce's challenge: Quantitative models and statistical methodology.Wikipedia:Citing sources#What information to include Thurstone, L.L. (1927). "A law of comparative judgement". Psychological Review 34 (4): 278286. doi:10.1037/h0070288 [61]. Tversky, A. (1967). "A general theory of polynomial conjoint measurement" [62] (PDF). Journal of Mathematical Psychology 4: 120. doi:10.1016/0022-2496(67)90039-9 [63]. Ullrich, J.R.; Wilson, R.E. (December 1993). "A note on the exact number of two and three way tables satisfying conjoint measurement and additivity axioms". Journal of Mathematical Psychology 37 (4): 6248. doi:10.1006/jmps.1993.1037 [64]. van der Linden, W. (March 1994). "Review of Michell (1990)" [65]. Psychometrika 59 (1): 139142. doi:10.1007/BF02294273 [66]. van der Ven, A.H.G.S. (1980). Introduction to Scaling. New York: Wiley.Wikipedia:Citing sources
403
404
External links
Karabatsos' S-Plus programs for testing conjoint axioms [67] Birnbaum's FORTRAN MONANOVA program for testing addivity [68] Kyngdon's R programs for enumerating cancellation tests, testing axioms and prospect theory [69] R statistical computing software [46]
References
[1] http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2876%2990036-5 [2] http:/ / dx. doi. org/ 10. 1037%2F0033-295X. 115. 2. 463 [3] http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 18426300 [4] http:/ / link. springer. com/ article/ 10. 1007/ BF02295985 [5] http:/ / dx. doi. org/ 10. 1007%2FBF02295985 [6] http:/ / dx. doi. org/ 10. 1111%2Fj. 1467-9280. 1992. tb00024. x [7] http:/ / www. sciencedirect. com/ science/ article/ pii/ S0022249608000758 [8] http:/ / dx. doi. org/ 10. 1016%2Fj. jmp. 2008. 08. 003 [9] http:/ / adsabs. harvard. edu/ abs/ 2008Metro. . 45. . 134E [10] http:/ / dx. doi. org/ 10. 1088%2F0026-1394%2F45%2F2%2F002 [11] http:/ / dx. doi. org/ 10. 1037%2F0096-1523. 9. 1. 126 [12] http:/ / link. springer. com/ article/ 10. 1007/ BF02294219 [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] http:/ / dx. doi. org/ 10. 1007%2FBF02294219 http:/ / www. sciencedirect. com/ science/ article/ pii/ S0022249696900231 http:/ / dx. doi. org/ 10. 1006%2Fjmps. 1996. 0023 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 8979975 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 18753694 http:/ / dx. doi. org/ 10. 1016%2F0165-4896%2885%2990031-9 http:/ / dx. doi. org/ 10. 1080%2F00049530108255118 http:/ / dx. doi. org/ 10. 2307%2F1914185 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 12011506 http:/ / tigger. uic. edu/ ~georgek/ HomePage/ KarabatsosJMP. pdf http:/ / dx. doi. org/ 10. 1016%2Fj. jmp. 2004. 11. 001 http:/ / dx. doi. org/ 10. 1177%2F0146621603260678 http:/ / dx. doi. org/ 10. 1016%2FS0165-4896%2802%2900024-0 http:/ / www. sciencedirect. com/ science/ article/ pii/ 0022249664900033 http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2864%2990003-3 http:/ / dx. doi. org/ 10. 1146%2Fannurev. ps. 18. 020167. 001245 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 5333423 http:/ / dx. doi. org/ 10. 1037%2Fh0030637 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 17068378 http:/ / dx. doi. org/ 10. 1177%2F0959354307086924 http:/ / dx. doi. org/ 10. 1348%2F2044-8317. 002004 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 21973097 http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 17215563 http:/ / onlinelibrary. wiley. com/ doi/ 10. 1111/ j. 2044-8317. 1972. tb00477. x/ abstract http:/ / dx. doi. org/ 10. 1111%2Fj. 2044-8317. 1972. tb00477. x http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 5031649 http:/ / www. sciencedirect. com/ science/ article/ pii/ 002224966490015X http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2864%2990015-X http:/ / www. sciencedirect. com/ science/ article/ pii/ 0022249677900359 http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2877%2990035-9 http:/ / www. sciencedirect. com/ science/ article/ pii/ S0022249684710169 http:/ / dx. doi. org/ 10. 1006%2Fjmps. 1994. 1016 http:/ / www. sciencedirect. com/ science/ article/ pii/ 0022249688900247 http:/ / dx. doi. org/ 10. 1016%2F0022-2496%2888%2990024-7
[47] http:/ / onlinelibrary. wiley. com/ doi/ 10. 1348/ 000711007X243582/ abstract [48] http:/ / dx. doi. org/ 10. 1348%2F000711007X243582 [49] http:/ / dx. doi. org/ 10. 1177%2F014662167900300213
405
Thurstone scale
In psychology, the Thurstone scale was the first formal technique for measuring an attitude. It was developed by Louis Leon Thurstone in 1928, as a means of measuring attitudes towards religion. It is made up of statements about a particular issue, and each statement has a numerical value indicating how favorable or unfavorable it is judged to be. People check each of the statements to which they agree, and a mean score is computed, indicating their attitude.
Thurstone scale
Thurstone's method of pair comparisons can be considered a prototype of a normal distribution-based method for scaling-dominance matrices. Even though the theory behind this method is quite complex (Thurstone, 1927a), the algorithm itself is straightforward. For the basic Case V, the frequency dominance matrix is translated into proportions and interfaced with the standard scores. The scale is then obtained as a left-adjusted column marginal average of this standard score matrix (Thurstone, 1927b). The underlying rationale for the method and basis for the measurement of the "psychological scale separation between any two stimuli" derives from Thurstone's Law of comparative judgment (Thurstone, 1928). The principal difficulty with this algorithm is its indeterminacy with respect to one-zero proportions, which return z values as plus or minus infinity, respectively. The inability of the pair comparisons algorithm to handle these cases imposes considerable limits on the applicability of the method. The most frequent recourse when the 1.00-0.00 frequencies are encountered is their omission. Thus, e.g., Guilford (1954, p.163) has recommended not using proportions more extreme than .977 or .023, and Edwards (1957, pp.4142) has suggested that if the number of judges is large, say 200 or more, then we might use pij values of .99 and .01, but with less than 200 judges, it is probably better to disregard all comparative judgments for which pij is greater than .98 or less than .02." Since the omission of such extreme values leaves empty cells in the Z matrix, the averaging procedure for arriving at the scale values cannot be applied, and an elaborate procedure for the estimation of unknown parameters is usually employed (Edwards, 1957, pp.4246). An alternative solution of this problem was suggested by Krus and Kennedy (1977). With later developments in psychometric theory, it has become possible to employ direct methods of scaling such as application of the Rasch model or unfolding models such as the Hyperbolic Cosine Model (HCM) (Andrich & Luo,
Thurstone scale 1993). The Rasch model has a close conceptual relationship to Thurstone's law of comparative judgment (Andrich, 1978), the principal difference being that it directly incorporates a person parameter. Also, the Rasch model takes the form of a logistic function rather than a cumulative normal function.
406
References
Andrich, D. (1978b) Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2, 449-460. Andrich, D. & Luo, G. (1993) A hyperbolic cosine model for unfolding dichotomous single-stimulus responses. Applied Psychological Measurement, 17, 253-276. Babbie, E., 'The Practice of Social Research', 10th edition, Wadsworth, Thomson Learning Inc., ISBN 0-534-62029-9 Edwards, A. L. Techniques of attitude scale construction. New York: Appleton-Century- Crofts, 1957. Guilford, J. P. Psychometric methods. New York: McGraw-Hill, 1954. Krus, D.J., & Kennedy, P.H. (1977) Normal scaling of dominance matrices: The domain-referenced model. Educational and Psychological Measurement, 37, 189-193 (Request reprint). [3] Krus, D.J., Sherman, J.L., & Kennedy, P.H. (1977) Changing values over the last half-century: the story of Thurstone's crime scales. Psychological Reports, 40, 207-211 (Request reprint). [1] Thurstone, L. L. (1927a) A Law of comparative judgment. Psychological Review, 34, 273-286. Thurstone, L. L. (1927b) The method of paired comparisons for social values. Journal of Abnormal and Social Psychology, 21, 384-400. Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529-54.
References
[1] http:/ / www. visualstatistics. net/ Readings/ Thurstone%20Crimes%20Scale/ Thurstone%20Crimes%20Scale. htm
Thurstonian model
407
Thurstonian model
A Thurstonian model is a latent variable model for describing the mapping of some continuous scale onto discrete, possibly ordered categories of response. In the model, each of these categories of response corresponds to a latent variable whose value is drawn from a normal distribution, independently of the other response variables and with constant variance. Thurstonian models have been used as an alternative to generalized linear models in analysis of sensory discrimination tasks.[] They have also been used to model long-term memory in ranking tasks of ordered alternatives, such as the order of the amendments to the US Constitution.[1] Their main advantage over other models ranking tasks is that they account for non-independence of alternatives.[]
Definition
Consider a set of m options to be ranked by n independent judges. Such a ranking can be represented by the ordering vector rn = (rn1, rn2,...,rnm). Rankings are assumed to be derived from real-valued latent variables zij, representing the evaluation of option j by judge i. Rankings ri are derived deterministically from zi such that zi(ri1) < zi(ri2) < ... < zi(rim). The zij are assumed to be derived from an underlying ground truth value for each option. In the most general case, they are multivariate-normally distributed:
where j is multivariate-normally distributed around 0 with covariance matrix . In a simpler case, there is a single standard deviation parameter i for each judge:
Inference
The Gibbs-sampler based approach to estimating model parameters is due to Yao and Bockenholt (1999).[] Step 1: Given , , and r_i, sample z_i. The zij must be sampled from a truncated multivariate normal distribution to preserve their rank ordering. Hajivassiliou's Truncated Multivariate Normal Gibbs sampler can be used to sample efficiently.[2][3] Step 2: Given , z_i, sample . is sampled from a normal distribution: where * and * are the current estimates for the means and covariance matrices. Step 3: Given , z_i, sample . 1 is sampled from a Wishart posterior, combining a Wishart prior with the data likelihood from the samples i =zi .
Thurstonian model
408
History
Thurstonian models were introduced by Louis Leon Thurstone to describe the law of comparative judgment.[4] Prior to 1999, Thurstonian models were rarely used for modeling tasks involving more than 4 options because of the high-dimensional integration required to estimate parameters of the model. In 1999, Yao and Bockenholt introduced their Gibbs-sampler based approach to estimating model parameters.[]
References
[4] Reprinted:
Description
Building on J.P. Guilford's work and created by Ellis Paul Torrance, the Torrance Tests of Creative Thinking (TTCT), a test of creativity, originally involved simple tests of divergent thinking and other problem-solving skills, which were scored on four scales: Fluency. The total number of interpretable, meaningful, and relevant ideas generated in response to the stimulus. Flexibility. The number of different categories of relevant responses. Originality. The statistical rarity of the responses. Elaboration. The amount of detail in the responses.
The third edition of the TTCT in 1984 eliminated the Flexibility scale from the figural test, but added Resistance to Premature Closure (based on Gestalt Psychology) and Abstractness of Titles as two new criterion-referenced scores on the figural. Torrance called the new scoring procedure Streamlined Scoring. With the five norm-referenced measures that he now had (fluency, originality, abstractness of titles, elaboration and resistance to premature closure), he added 13 criterion-referenced measures which include: emotional expressiveness, story-telling articulateness, movement or actions, expressiveness of titles, syntheses of incomplete figures, synthesis of lines, of circles, unusual visualization, extending or breaking boundaries, humor, richness of imagery, colourfulness of imagery, and fantasy.[1] According to Arasteh and Arasteh (1976) the most systematic assessment of creativity in elementary school children has been conducted by Torrance and his associates (1960a,1960b, 1960c, 1961,1962,1962a,1963a 1964), who have developed and administered the Minnesota Tests of Creative Thinking (MTCT), which was later renamed as the TTCT, to several thousands of school children. Although they have used
Torrance Tests of Creative Thinking many of Guilfords concepts in their test construction, the Minnesota group, in contrast to Guilford, has devised tasks which can be scored for several factors, involving both verbal and non-verbal aspects and relying on senses other than vision. These tests represent a fairly sharp departure from the factor type tests developed by Guilford and his associates (Guilford, Merrifield and Cox, 1961; Merrifield, Guilford and Gershan,1963), and they also differ from the battery developed by Wallach and Kogan (1965), which contains measures representing creative tendencies that are similar in nature (Torrance, 1968). To date, several longitudinal studies have been conducted to follow up the elementary school-aged students who were first administered the Torrance Tests in 1958 in Minnesota. There was a 22-year follow-up,[2][3][4] a 40-year follow-up,[5] and a 50 year follow-up [6] Torrance (1962) grouped the different subtests of the Minnesota Tests of Creative Thinking (MTCT) into three categories. 1. Verbal tasks using verbal stimuli 2. Verbal tasks using non-verbal stimuli 3. Non-verbal tasks
409
Tasks
A brief description of the tasks used by Torrance is given below:
Unusual Uses
The unusual uses tasks using verbal stimuli are direct modifications of Guilfords Brick uses test. After preliminary tryouts, Torrance (1962) decided to substitute tin cans and books for bricks. It was believed the children would be able to handle tin cans and books more easily since both are more available to children than bricks. Impossibilities task It was used originally by Guilford and his associates (1951) as a measure of fluency involving complex restrictions and large potential. In a course in personality development and mental hygiene, Torrance has experimented with a number of modifications of the basic task, making the restrictions more specific. In this task the subjects are asked to list as many impossibilities as they can. Consequences task The consequences task was also used originally by Guilford and his associates (1951). Torrance has made several modifications in adapting it. He chose three improbable situations and the children were required to list out their consequences. Just suppose task It is an adaptation of the consequences type of test designed to elicit a higher degree of spontaneity and to be more effective with children. As in the consequence task, the subject is confronted with an improbable situation and asked to predict the possible outcomes from the introduction of a new or unknown variable. Situations task The situation task was modeled after Guilfords (1951) test designed to assess the ability to see what needs to be done. Subjects were given three common problems and asked to think of as many solutions to these problems as they can. For example, if all schools were abolished, what would you do to try to become educated? Common problems task This task is an adoption of Guilfords (1951) Test designed to assess the ability to see defects, needs and deficiencies and found to be one of the test of the factors termed sensitivity to problems. Subjects are instructed that they will be given common situations and that they will be asked to think of as many problems
Torrance Tests of Creative Thinking as they can that may arise in connection with these situations. For example, doing homework while going to school in the morning. Improvement task This test was adopted from Guilfords (1952) apparatus test which was designed to assess ability to see defects and all aspects of sensitivity to problems. In this task the subjects are given a list of common objects and are asked to suggest as many ways as they can to improve each object. They are asked not to bother about whether or not it is possible to implement the change thought of. Mother- Hubbard problem This task was conceived as an adoption of the situations task for oral administration in the primary grades and also useful for older groups. This test has stimulated a number of ideas concerning factors which inhibit the development of ideas. Imaginative stories task In this task the child is told to write the most interesting and exciting story he can think of. Topics are suggested (e.g., the dog that did not bark); or the child may use his own ideas. Cow jumping problems The Cow jumping problem is a companion task for the Mother- Hubbard problem and has been administered to the same groups under the same conditions and scored according to the similar procedures. The task is to think of all possible things which might have happened when the cow jumped over the moon.
410
Non-verbal tasks
Incomplete figures task It is an adaptation of the Drawing completion test developed by Kate Franck and used by Barron (1958). On an ordinary white paper an area of fifty four square inches is divided into six squares each containing a different stimulus figure. The subjects are asked to sketch some novel objects or design by adding as many lines as they can to the six figures. Picture construction task or shapes task In this task the children are given shape of a triangle or a jelly bean and a sheet of white paper. The children are asked to think of a picture in which the given shape is an integral part. They should paste it wherever they want on the white sheet and add lines with pencil to make any novel picture. They have to think of a name for the picture and write it at the bottom.
Torrance Tests of Creative Thinking Circles and squares task It was originally designed as a nonverbal test of ideational fluency and flexibility, then modified in such a way as to stress originality and elaboration. Two printed forms are used in the test. In one form, the subject is confronted with a page of forty two circles and asked to sketch objects or pictures which have circles as a major part. In the alternate form, squares are used instead of circles. Creative design task Hendrickson has designed it which seems to be promising, but scoring procedures are being tested but have not been perfected yet. The materials consist of circles and strips of various sizes and colours, a four page booklet, scissors and glue. Subjects are instructed to construct pictures or designs, making use of all of the coloured circles and strips with a thirty minute time limit. Subjects may use one, two, three, or four pages; alter the circles and strips or use them as they are; add other symbols with pencil or crayon.
411
References
[2] (Torrance, E. P. (1980). Growing Up Creatively Gifted: The 22-Year Longitudinal Study. The Creative Child and Adult Quarterly, 3, 148-158. [3] Torrance, E. P. (1981a). Predicting the creativity of elementary school children (1958 80)and the teacher who "made a difference." Gifted Child Quarterly, 25, 55-62. [4] Torrance, E. P. (1981b). Empirical validation of criterionreferenced indicators of creative ability through a longitudinal study. Creative Child and Adult Quarterly, 6, 136-140. [5] Cramond, B., MatthewsMorgan, J., Bandalos, D., & Zuo, L. (2005). A report on the 40 year followup of the Torrance Tests of Creative Thinking: Alive and Well in the New Millennium. Gifted Child Quarterly, 49, 283-291. [6] Runco, M. A., Millar, G., Acar, S., Cramond, B. (2011) Torrance Tests of Creative Thinking as Predictors of Personal and Public Achievement: A Fifty Year Follow-Up. Creativity Research Journal, 22 (4). in press.
William H. Tucker
William H. Tucker is a professor of psychology at Rutgers University and the author of several books critical of race science. Tucker received his bachelor's degree from Bates College in 1967, and his master's and doctorate from Princeton University. He joined the faculty at Rutgers University in 1970 and has been there since. Tucker was a Psychometric Fellow for three years at Princeton, a position subsidized by Educational Testing Service. The majority of Tucker's scholarship has been about psychometrics, not in it. He currently sits on the advisory board of the Institute for the Study of Academic Racism.[1] He has written critical commentaries on several hereditarian psychologists known for their controversial work on race and intelligence. He has received awards for his research on Cyril Burt and the Pioneer Fund.[2] According to his website, "My research interests concern the useor more properly the misuseof social science to support oppressive social policies, especially in the area of race. I seek to explore how scientists in general, and psychologists in particular, have become involved with such issues and what effect their participation has produced."
William H. Tucker
412
Publications
Tucker WH (1994a). Fact and Fiction in the Discovery of Sir Cyril Burt's Flaws. Journal of the History of the Behavioral Sciences, 30, 335-347. Tucker WH (1994b). The Science and Politics of Racial Research [3]. University of Illinois Press. ISBN 0-252-02099-5 Tucker WH (1997). Re-reconsidering Burt: Beyond a reasonable doubt. Journal of the History of the Behavioral Sciences, 33, 145-162. Tucker WH (2002). The Funding of Scientific Racism: Wycliff Draper and the Pioneer Fund [4]. University of Illinois Press. ISBN 0-252-02762-0 Tucker WH (2005). The Intelligence Controversy: A Guide to the Debates. ABC-Clio, Inc. ISBN 1-85109-409-1 Tucker WH (2009). The Cattell Controversy: Race, Science, and Ideology, University of Illinois Press
References
[1] ISAR Advisor Council (http:/ / web. archive. org/ web/ 20060207194059/ http:/ / www. ferris. edu/ isar/ avc. htm), Retrieved Frebruary 7, 2006 [2] University of Illinois Press: "Winner of the Anisfield-Wolf Award, 1995. Winner of the Ralph J. Bunche Award, American Political Science Association, 1995. Outstanding Book from the Gustavus Myers Center for the Study of Human Rights in North America." [3] http:/ / www. press. uillinois. edu/ s96/ tucker. html [4] http:/ / www. press. uillinois. edu/ epub/ books/ tucker/ toc. html
External links
Bill Tucker homepage (http://crab.rutgers.edu/~btucker/home.html) via Rutgers University. Does Science Offer Support for Racial Separation? (http://www.ferris.edu/isar/bios/cattell/HPPB/science. htm) via Institute for the Study of Academic Racism.
Validity (statistics)
413
Validity (statistics)
In science and statistics, validity is the extent to which a concept, conclusion or measurement is well-founded and corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool (for example, a test in education) is considered to be the degree to which the tool measures what it claims to measure. In psychometrics, validity has a particular application known as test validity: "the degree to which evidence and theory support the interpretations of test scores" ("as entailed by proposed uses of tests").[1] In the area of scientific research design and experimentation, validity refers to whether a study is able to scientifically answer the questions it is intended to answer. In clinical fields, the assessment of validity of a diagnosis and various diagnostic tests are extremely important. As diagnosis augments treatments, medications, and the patient's life, it is extremely important to know that when running diagnostic tests that clinicians are truly testing what they intend to test. It is generally accepted that the concept of scientific validity addresses the nature of reality and as such is an epistemological and philosophical issue as well as a question of measurement. The use of the term in logic is narrower, relating to the truth of inferences made from premises. Validity is important because it can help determine what types of tests to use, and help to make sure researchers are using methods that are not only ethical, and cost-effective, but also a method that truly measures the idea or construct in question.
Test validity
Reliability (consistency) and validity (accuracy)
Validity of an assessment is the degree to which it measures what it is supposed to measure. This is not the same as reliability, which is the extent to which a measurement gives results that are consistent. Within validity, the measurement does not always have to be similar, as it does in reliability. When a measure is both valid and reliable, the results will appear as in the image to the right. Though, just because a measure is reliable, it is not necessarily valid (and vice-versa). Validity is also dependent on the measurement measuring what it was designed to measure, and not something else instead.[2] Validity (similar to reliability) is based on matters of degrees; validity is not an all or nothing idea. There are many different types of validity. An early definition of test validity identified it with the degree of Validity & Reliability correlation between the test and a criterion. Under this definition, one can show that reliability of the test and the criterion places an upper limit on the possible correlation between them (the so-called validity coefficient). Intuitively, this reflects the fact that reliability involves freedom from random error and random errors do not correlate with one another. Thus, the less random error in the variables, the higher the possible correlation between them. Under these definitions, a test cannot have high validity unless it also has high reliability. However, the concept of validity has expanded substantially beyond this early definition and the classical relationship between reliability and validity need not hold for alternative conceptions of reliability and validity.
Validity (statistics) Within classical test theory, predictive or concurrent validity (correlation between the predictor and the predicted) cannot exceed the square root of the correlation between two versions of the same measure that is, reliability limits validity.
414
Construct validity
Construct validity refers to the extent to which operationalizations of a construct (i.e., practical tests developed from a theory) do actually measure what the theory says they do. For example, to what extent is an IQ questionnaire actually measuring "intelligence"? Construct validity evidence involves the empirical and theoretical support for the interpretation of the construct. Such lines of evidence include statistical analyses of the internal structure of the test including the relationships between responses to different test items. They also include relationships between the test and measures of other constructs. As currently understood, construct validity is not distinct from the support for the substantive theory of the construct that the test is designed to measure. As such, experiments designed to reveal aspects of the causal role of the construct also contribute to construct validity evidence. Convergent validity Convergent validity refers to the degree to which a measure is correlated with other measures that it is theoretically predicted to correlate with.
Content validity
Content validity is a non-statistical type of validity that involves "the systematic examination of the test content to determine whether it covers a representative sample of the behavior domain to be measured" (Anastasi & Urbina, 1997 p.114). For example, does an IQ questionnaire have items covering all areas of intelligence discussed in the scientific literature? Content validity evidence involves the degree to which the content of the test matches a content domain associated with the construct. For example, a test of the ability to add two numbers should include a range of combinations of digits. A test with only one-digit numbers, or only even numbers, would not have good coverage of the content domain. Content related evidence typically involves subject matter experts (SME's) evaluating test items against the test specifications. A test has content validity built into it by careful selection of which items to include (Anastasi & Urbina, 1997). Items are chosen so that they comply with the test specification which is drawn up through a thorough examination of the subject domain. Foxcroft, Paterson, le Roux & Herbst (2004, p.49)[3] note that by using a panel of experts to review the test specifications and the selection of items the content validity of a test can be improved. The experts will be able to review the items and comment on whether the items cover a representative sample of the behaviour domain. Representation validity Representation validity, also known as translation validity, is about the extent to which an abstract theoretical construct can be turned into a specific practical test Face validity Face validity is an estimate of whether a test appears to measure a certain criterion; it does not guarantee that the test actually measures phenomena in that domain. Measures may have high validity, but when the test does not appear to be measuring what it is, it has low face validity. Indeed, when a test is subject to faking (malingering), low face validity might make the test more valid. Considering one may get more honest answers with lower face validity, it is sometimes important to make it appear as though there is low face validity whilst administering the measures.
Validity (statistics) Face validity is very closely related to content validity. While content validity depends on a theoretical basis for assuming if a test is assessing all domains of a certain criterion (e.g. does assessing addition skills yield in a good measure for mathematical skills? - To answer this you have to know, what different kinds of arithmetic skills mathematical skills include) face validity relates to whether a test appears to be a good measure or not. This judgment is made on the "face" of the test, thus it can also be judged by the amateur. Face validity is a starting point, but should NEVER be assumed to be provably valid for any given purpose, as the "experts" have been wrong beforethe Malleus Malificarum (Hammer of Witches) had no support for its conclusions other than the self-imagined competence of two "experts" in "witchcraft detection," yet it was used as a "test" to condemn and burn at the stake tens of thousands women as "witches."[4]
415
Criterion validity
Criterion validity evidence involves the correlation between the test and a criterion variable (or variables) taken as representative of the construct. In other words, it compares the test with other measures or outcomes (the criteria) already held to be valid. For example, employee selection tests are often validated against measures of job performance (the criterion), and IQ tests are often validated against measures of academic performance (the criterion). If the test data and criterion data are collected at the same time, this is referred to as concurrent validity evidence. If the test data are collected first in order to predict criterion data collected at a later point in time, then this is referred to as predictive validity evidence. Concurrent validity Concurrent validity refers to the degree to which the operationalization correlates with other measures of the same construct that are measured at the same time. When the measure is compared to another measure of the same type, they will be related (or correlated). Returning to the selection test example, this would mean that the tests are administered to current employees and then correlated with their scores on performance reviews. Predictive validity Predictive validity refers to the degree to which the operationalization can predict (or correlate with) other measures of the same construct that are measured at some time in the future. Again, with the selection test example, this would mean that the tests are administered to applicants, all applicants are hired, their performance is reviewed at a later time, and then their scores on the two measures are correlated. This is also when your measurement predicts a relationship between what you are measuring and something else; predicting whether or not the other thing will happen in the future. This type of validity is important from a public view standpoint; is this going to look acceptable to the public or not?
Experimental validity
The validity of the design of experimental research studies is a fundamental part of the scientific method, and a concern of research ethics. Without a valid design, valid scientific conclusions cannot be drawn.
Validity (statistics) Statistical conclusion validity involves ensuring the use of adequate sampling procedures, appropriate statistical tests, and reliable measurement procedures. As this type of validity is concerned solely with the relationship that is found among variables, the relationship may be solely a correlation.
416
Internal validity
Internal validity is an inductive estimate of the degree to which conclusions about causal relationships can be made (e.g. cause and effect), based on the measures used, the research setting, and the whole research design. Good experimental techniques, in which the effect of an independent variable on a dependent variable is studied under highly controlled conditions, usually allow for higher degrees of internal validity than, for example, single-case designs. Eight kinds of confounding variable can interfere with internal validity (i.e. with the attempt to isolate causal relationships): 1. History, the specific events occurring between the first and second measurements in addition to the experimental variables 2. Maturation, processes within the participants as a function of the passage of time (not specific to particular events), e.g., growing older, hungrier, more tired, and so on. 3. Testing, the effects of taking a test upon the scores of a second testing. 4. Instrumentation, changes in calibration of a measurement tool or changes in the observers or scorers may produce changes in the obtained measurements. 5. Statistical regression, operating where groups have been selected on the basis of their extreme scores. 6. Selection, biases resulting from differential selection of respondents for the comparison groups. 7. Experimental mortality, or differential loss of respondents from the comparison groups. 8. Selection-maturation interaction, etc. e.g., in multiple-group quasi-experimental designs
External validity
External validity concerns the extent to which the (internally valid) results of a study can be held to be true for other cases, for example to different people, places or times. In other words, it is about whether findings can be validly generalized. If the same research study was conducted in those other cases, would it get the same results? A major factor in this is whether the study sample (e.g. the research participants) are representative of the general population along relevant dimensions. Other factors jeopardizing external validity are: 1. Reactive or interaction effect of testing, a pretest might increase the scores on a posttest 2. Interaction effects of selection biases and the experimental variable. 3. Reactive effects of experimental arrangements, which would preclude generalization about the effect of the experimental variable upon persons being exposed to it in non-experimental settings 4. Multiple-treatment interference, where effects of earlier treatments are not erasable. Ecological validity Ecological validity is the extent to which research results can be applied to real life situations outside of research settings. This issue is closely related to external validity but covers the question of to what degree experimental findings mirror what can be observed in the real world (ecology = the science of interaction between organism and its environment). To be ecologically valid, the methods, materials and setting of a study must approximate the real-life situation that is under investigation. Ecological validity is partly related to the issue of experiment versus observation. Typically in science, there are two domains of research: observational (passive) and experimental (active). The purpose of experimental designs is to test causality, so that you can infer A causes B or B causes A. But sometimes, ethical and/or methological restrictions prevent you from conducting an experiment (e.g. how does isolation influence a child's cognitive
Validity (statistics) functioning?). Then you can still do research, but it's not causal, it's correlational. You can only conclude that A occurs together with B. Both techniques have their strengths and weaknesses. Relationship to internal validity On first glance, internal and external validity seem to contradict each other - to get an experimental design you have to control for all interfering variables. That's why you often conduct your experiment in a laboratory setting. While gaining internal validity (excluding interfering variables by keeping them constant) you lose ecological or external validity because you establish an artificial laboratory setting. On the other hand with observational research you can't control for interfering variables (low internal validity) but you can measure in the natural (ecological) environment, at the place where behavior normally occurs. However, in doing so, you sacrifice internal validity. The apparent contradiction of internal validity and external validity is, however, only superficial. The question of whether results from a particular study generalize to other people, places or times arises only when one follows an inductivist research strategy. If the goal of a study is to deductively test a theory, one is only concerned with factors which might undermine the rigor of the study, i.e. threats to internal validity.
417
Diagnostic validity
In clinical fields such as medicine, the validity of a diagnosis, and associated diagnostic tests or screening tests, may be assessed. In regard to tests, the validity issues may be examined in the same way as for psychometric tests as outlined above, but there are often particular applications and priorities. In laboratory work, the medical validity of a scientific finding has been defined as the 'degree of achieving the objective' - namely of answering the question which the physician asks.[6] An important requirement in clinical diagnosis and testing is sensitivity and specificity - a test needs to be sensitive enough to detect the relevant problem if it is present (and therefore avoid too many false negative results), but specific enough not to respond to other things (and therefore avoid too many false positive results).[7] In psychiatry there is a particular issue with assessing the validity of the diagnostic categories themselves. In this context:[] content validity may refer to symptoms and diagnostic criteria; concurrent validity may be defined by various correlates or markers, and perhaps also treatment response; predictive validity may refer mainly to diagnostic stability over time; discriminant validity may involve delimitation from other disorders.
Robins and Guze proposed in 1970 what were to become influential formal criteria for establishing the validity of psychiatric diagnoses. They listed five criteria:[] distinct clinical description (including symptom profiles, demographic characteristics, and typical precipitants) laboratory studies (including psychological tests, radiology and postmortem findings) delimitation from other disorders (by means of exclusion criteria) follow-up studies showing a characteristic course (including evidence of diagnostic stability) family studies showing familial clustering
These were incorporated into the Feighner Criteria and Research Diagnostic Criteria that have since formed the basis of the DSM and ICD classification systems. Kendler in 1980 distinguished between:[] antecedent validators (familial aggregation, premorbid personality, and precipitating factors) concurrent validators (including psychological tests) predictive validators (diagnostic consistency over time, rates of relapse and recovery, and response to treatment)
Validity (statistics) Nancy Andreasen (1995) listed several additional validators molecular genetics and molecular biology, neurochemistry, neuroanatomy, neurophysiology, and cognitive neuroscience - that are all potentially capable of linking symptoms and diagnoses to their neural substrates.[] Kendell and Jablinsky (2003) emphasized the importance of distinguishing between validity and utility, and argued that diagnostic categories defined by their syndromes should be regarded as valid only if they have been shown to be discrete entities with natural boundaries that separate them from other disorders.[] Kendler (2006) emphasized that to be useful, a validating criterion must be sensitive enough to validate most syndromes that are true disorders, while also being specific enough to invalidate most syndromes that are not true disorders. On this basis, he argues that a Robins and Guze criterion of "runs in the family" is inadequately specific because most human psychological and physical traits would qualify - for example, an arbitrary syndrome comprising a mixture of "height over 6 ft, red hair, and a large nose" will be found to "run in families" and be "hereditary", but this should not be considered evidence that it is a disorder. Kendler has further suggested that "essentialist" gene models of psychiatric disorders, and the hope that we will be able to validate categorical psychiatric diagnoses by "carving nature at its joints" solely as a result of gene discovery, are implausible.[8] In the United States Federal Court System validity and reliability of evidence is evaluated using the Daubert Standard: see Daubert v. Merrell Dow Pharmaceuticals. Perri and Lichtenwald (2010) provide a starting point for a discussion about a wide range of reliability and validity topics in their analysis of a wrongful murder conviction.[9]
418
References
[1] American Educational Research Association, Psychological Association, & National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. [2] Kramer, Geoffrey P., Douglas A. Bernstein, and Vicky Phares. Introduction to clinical psychology. 7th ed. Upper Saddle River, NJ: Pearson Prentice Hall, 2009. Print. [3] Foxcroft, C., Paterson, H., le Roux, N., & Herbst, D. Human Sciences Research Council, (2004). 'Psychological assessment in South Africa: A needs analysis: The test use patterns and needs of psychological assessment practitioners: Final Report: July. Retrieved from website: http:/ / www. hsrc. ac. za/ research/ output/ outputDocuments/ 1716_Foxcroft_Psychologicalassessmentin%20SA. pdf [4] The most common estimates are between 40,000 and 60,000 deaths. Brian Levack (The Witch Hunt in Early Modern Europe) multiplied the number of known European witch trials by the average rate of conviction and execution, to arrive at a figure of around 60,000 deaths. Anne Lewellyn Barstow (Witchcraze) adjusted Levack's estimate to account for lost records, estimating 100,000 deaths. Ronald Hutton (Triumph of the Moon) argues that Levack's estimate had already been adjusted for these, and revises the figure to approximately 40,000. [5] Cozby, Paul C.. Methods in behavioral research. 10th ed. Boston: McGraw-Hill Higher Education, 2009. Print.
External links
Cronbach, L. J.; Meehl, P. E. (1955). "Construct validity in psychological tests" (http://psychclassics.yorku.ca/ Cronbach/construct.htm). Psychological Bulletin 52 (4): 281302. doi: 10.1037/h0040957 (http://dx.doi.org/ 10.1037/h0040957). PMID 13245896 (http://www.ncbi.nlm.nih.gov/pubmed/13245896).
Values scales
419
Values scales
Values scales are psychological inventories used to determine the values that people endorse in their lives. They facilitate the understanding of both work and general values that individuals uphold. In addition, they assess the importance of each value in peoples lives and how the individual strives toward fulfillment through work and other life roles, such as parenting.[1] Most scales have been normalized and can therefore be used cross-culturally for vocational, marketing, and counseling purposes, yielding unbiased results.[2] Values scales are used by psychologists, political scientists, economists, and others interested in defining values, determining what people value, and evaluating the ultimate function or purpose of values.[3]
Development
Values scales were first developed by an international group of psychologists whose goal was to create a unique self-report instrument that measured intrinsic and extrinsic values for use in the lab and in the clinic. The psychologists called their project the Work Importance Study (WIS). The original values scale measured the following values, listed in alphabetical order: ability utilization, achievement, advancement, aesthetics, altruism, authority, autonomy, creativity, cultural identity, economic rewards, economic security, life style, personal development, physical activity, physical prowess, prestige, risk, social interaction, social relations, variety, and working conditions. Some of the listed values were intended to be inter-related, but conceptually differentiable.[1] Since the original Work Importance Study, several scientists have supplemented the study by creating their own scale or by deriving and improving the original format. Theorists and psychologists often study values, values scales, and the field surrounding values, otherwise known as axiology.[4] New studies have even been published recently, updating the work in the field. Dr. Eda Gurel-Atay published an article in the Journal of Advertising Research in March 2010, providing a glimpse into how social values have changed between 1976 and 2007. The paper explained how self-respect has been on the upswing, while a sense of belonging has become less important to individuals.[5]
Contributing Scientists
Rokeach
According to Milton Rokeach, a prominent social psychologist, human values are defined as core conceptions of the desirable within every individual and society. They serve as standards or criteria to guide not only action but also judgment, choice, attitude, evaluation, argument, exhortation, rationalization, andattribution of causality.[6] In his 1979 publication, Rokeach also stated that the consequences of human values would be manifested in all phenomena that social scientists might consider worth investigating. In order for any type of research to be successful, regardless of the field of study, peoples underlying values needed to be understood. To allow for this, Rokeach created the Rokeach Value Survey (RVS), which has been in use for more than 30 years. It provides a theoretical perspective on the nature of values in a cognitive framework and consists of two sets of values 18 instrumental and 18 terminal.[7] Instrumental values are beliefs or conceptions about desirable modes of behavior that are instrumental to the attainment of desirable end points, such as honesty, responsibility, and capability. Terminal values are beliefs or conceptions about ultimate goals of existence that are worth surviving for, such as happiness, self-respect, and freedom.[8] The value survey asks subjects to rank the values in order of importance to them.[7] The actual directions are as follows: Rank each value in its order of importance to you. Study the list and think of how much each value may act as a guiding principle in your life.[9] The Rokeach Value Survey has been criticized because people are often not able to rank each value clearly. Some values may be equally important, while some values may be equally unimportant, and so on. Presumably, people are more certain of their most extreme values (i.e. what they love and what they hate) and are not so certain of the ones in between. Further, C.J. Clawson and Donald E. Vinson (1977) proved that the Rokeach Value Survey omitted a number of values that a large portion of the population holds.[7]
Values scales
420
Schwartz
Shalom H. Schwartz, social psychologist and author of The Structure of Human Values: Origins and Implications and Theory of Basic Human Values, has done research on universal values and how they exist in a wide variety of contexts.[10] Most of his work addressed broad questions about values, such as: how are individuals priorities affected by social experiences? How do individuals priorities influence their behavior and choices? And, how do value priorities influence ideologies, attitudes, and actions in political, religious, environmental, and other domains? Through his studies, Schwartz concluded that ten types of universal values exist: achievement, benevolence, conformity, hedonism, power, security, self-direction, stimulation, tradition, and universalism. Schwartz also tested the possibility of spirituality as an eleventh universal value, but found that it did not exist in all cultures.[11] Schwartz's value theory and instruments are part of the biannual European Social Survey.
Shalom H. Schwartz
Allport-Vernon-Lindzey
Gordon Allport, a student of German philosopher and psychologist Eduard Spranger,[12] believed that an individuals philosophy is founded upon the values or basic convictions that he holds about what is and is not important in life.[13] Based on Sprangers (1928) view that understanding the individuals value philosophy best captures the essence of a person, Allport and his colleagues, Vernon and Lindzey, created the Allport-Vernon-Lindzey Study of Values. The values scale outlined six major value types: theoretical (discovery of truth), economic (what is most useful), aesthetic (form, beauty, and harmony), social (seeking love of people), political (power), and religious (unity). Forty years after the studys publishing in 1931, it was the third most-cited non-projective personality measure.[4] By 1980, the values scale had fallen into disuse due to its archaic content, lack of religious inclusiveness, and dated language. Richard E. Kopelman, et al., recently updated the Allport-Vernon-Lindzey Study of Values. The motivation behind their update was to make the value scale more relevant to today; they believed that the writing was too dated. The updated, copyrighted version was published in Elsevier Science in 2003. Today, permission is required for use.[4] (volume 62)
Hartman
Philosopher Robert S. Hartman, creator of the Science of Value, introduced and identified the concept of systematic values, which he believed were an important addition to the previously studied intrinsic and extrinsic values. He also made an illuminating distinction between what people value and how people value. How people value parallels very closely with systematic values, which Hartman operationally defined as conceptual constructs or cognitive scripts that exist in peoples minds. Ideals, norms, standards, rules, doctrines, and logic systems are all examples of systematic values. If someones cognitive script is repetitively about violent actions, for instance, then that person is more likely act vengefully and less likely to value peace. With that additional idea in mind, Harman combined intrinsic, extrinsic, and systematic concepts to create the Hartman Value Profile, also known as the Hartman Value Inventory. The profile consists of two parts. Each part contains 18 paired value-combination items, where nine of these items are positive and nine are negative. The three different types of values, intrinsic, extrinsic, and systematic, can be combined positively or negatively with one another in 18 logically possible ways. Depending on the combination, a certain value is either enhanced or diminished. Once the rankings are completed, the outcome is then compared to the theoretical norm, generating scores for psychological interpretation.[13]
Values scales
421
Applications to Psychology
Research surrounding understanding values serves as a framework for ideas in many other situations, such as counseling. Psychotherapists, behavioral scientists, and social scientists often deal with intrinsic, extrinsic, and systematic values of their patients.[14] A primary way to learn about patients is to know what they value, as values are essential keys to personality structures. This knowledge can pinpoint serious problems in living, aide immensely in planning therapeutic regimens, and measure therapeutic progress with applications of values scales over time, especially as social environments and social norms change.[13]
References
[1] Super, Donald and Dorothy D. Nevill. Brief Description of Purpose and Nature of Test. Consulting Psychologists Press. 1989: 3-10. Print. [2] Beatty, Sharon E., et al. Alternative Measurement Approaches to Consumer Values: The List of Values and the Rokeach Value Survey. Psychology and Marketing. 1985: 181-200. Web. [3] Johnston, Charles S. The Rokeach Value Survey: Underlying Structure and Multidimensional Scaling. The Journal of Psychology. 1995: 583-597. Print. [4] Kopelman, Richard E., et al. The Study of Values: Construction of the fourth edition. Journal of Vocational Behavior. 2003: 203-220. Print. [5] Gurel-Atay, Eda. Changes in Social Values in the United States: 1976-2007, Self-Respect is on the Upswing as A Sense of Belonging Becomes Less Important. Journal of Advertising Research. 2010: 57-67. Print. [6] Rokeach, M. The Nature of Human Values. The Free Press. NY: Free Press. 1979. [7] Beatty, Sharon E., et al. Alternative Measurement Approaches to Consumer Values: The List of Values and the Rokeach Value Survey. Psychology and Marketing. 1985: 181-200. Web. [8] Piirto, Jane. I Live in My Own Bubble: The Values of Talented Adolescents. The Journal of Secondary Gifted Education. 2005: 106-118. Web. [9] Rokeach, M. The Nature of Human Values. The Free Press. NY: Free Press. 1979. [10] Schwartz, S.H. "Are There Universal Aspects in the Content and Structure of Values?" Journal of Social Issues. 1994: 19-45. Print. [11] Schwartz, Shalom H. Universals in the Content and Structure of Values: Theoretical Advances and Empirical Tests in 20 Countries. Advances in Experimental Psychology. 1992: 1-65. Print. [12] Allport, G.W. Becoming: Basic Considerations for a Psychology of Personality. Yale University Press. 1955. Web. [13] Pomeroy, Leon and Rem B. Edwards. The New Science of Axiological Psychology. New York, NY: 2005. Print. [14] Hills, M.D. Kluckhohn and Strodtbecks Values Orientation Theory. Online Readings in Psychology and Culture. 2002. Web. [15] Piirto, Jane. I Live in My Own Bubble: The Values of Talented Adolescents. The Journal of Secondary Gifted Education. 2005: 106-118. Web.
422
Biomechanics
Two month old child begins to poise the head in vertical position on reflex level, firstly performs visible movements for it. An adult person also performs micromovements for poise vertical head position, because it is impossible to coordinate vertical mechanical balance of heavy object without movements. The trajectory of 3D head movement is enough complicated[2][3] and used for different vestibular reflexes researches and human health diagnostics, because vestibular system links with sensory system, nervous system and every part of human body. Sensory systems code for four aspects of a stimulus; type, intensity, location, and duration. Certain receptors are sensitive to certain types of stimuli (for example, different mechanoreceptors respond best to different kinds of touch stimuli, like sharp or blunt objects). Receptors send impulses in certain patterns to send information about the intensity of a stimulus (for example, how loud a sound is). Russian neurophysiologist Nikolai Bernstein spent most part of his life to physiology of movement. He also coined the term biomechanics, the study of movement through the application of mechanical principles. The principles of biological feedback and discrete movement discovered by Bernstein, forms one of the VER bases and his calculation of human movement time discrete about 0.1 sec was confirmed by video image analysis.
VER model. Human head moves slowly when person is calm and still (white head image). Human head moves fast and frequently when person is active, aggressive, anxiety and nervous (red head image)
Vestibular system as typical sensory system reacts to stimulus. But gravitation is constantly working stimulus, so vertical head coordination becomes constantly working and reflex process. This is the main physiology difference between vertical head coordination and any other sensory process that works sometimes. This difference transfers vertical head coordination into typical physiological process as heart rate (HR) measured by ECG and blood pressure, brain activity measured by electroencephalography (EEG), or thermoregulation measured by galvanic-skin respond (GSR). Biological evolution used head vertical coordination for energy regulation,[4] because natural head movement is ideal vibration movement with high energy range. The other sample of nature vibration process for energy regulation is dog tail wagging, but humans have not tail and head movement is better for it. It is understandable, that more high frequency head movement requests more energy, than low frequency movement. On
Vestibulo emotional reflex sensor level it means, that signals send from vestibular receptors to autonomic nervous system, brain and muscles are going with different time delay, depends on biochemical human state. That means dependence between emotional state and vestibular head coordination or vestibulo-emotional reflex.
423
Head movement vestibulogram signal captured by low noise web camera with resolution 640x480 pixels and 30 f/s frequency
VER application
VER gives functional information about person and could be apply for medical, eHealth, psychology and behavior testing, lie detection, emotion control, self-regulation, fitness, animals research are also providing by the different types of vibraimage system.[5] Vibraimage system transforms biomechanics movement into emotional and physiological data of person by video image processing. This process could be remote and hidden for user that is important for security applications, as aviation security.
References
Facial vibraimage with frequency scale
424
External links
A description and example of a VAS scale [6] VAS Generator - a free Web service to create VASs for computerized questionnaires [7]
References
[1] U.-D. Reips and F. Funke (2008) Interval level measurement with visual analogue scales in Internet-based research: VAS Generator. [2] S. Grant, T. Aitchison, E. Henderson, J. Christie, S. Zare, J. McMurray, and H. Dargie (1999) A comparison of the reproducibility and the sensitivity to change of visual analogue scales, borg scales, and likert scales in normal subjects during submaximal exercise. [3] U.-D. Reips and F. Funke (2008) Interval level measurement with visual analogue scales in Internet-based research: VAS Generator. [4] U.-D. Reips (2006) Web-based methods. In M. Eid & E. Diener (Eds.), Handbook of multimethod measurement in psychology (pp. 73-85). Washington, DC: American Psychological Association. [5] U.-D. Reips and F. Funke (2008) Interval level measurement with visual analogue scales in Internet-based research: VAS Generator. [6] http:/ / www. cebp. nl/ vault_public/ filesystem/ ?ID=1478 [7] http:/ / vasgenerator. net
425
The subscale scores can be used to identify and target particularly problematic areas as a focus of treatment and help with treatment planning. [] These questionnaires have been used in outcome studies for individual teen programs and groups of therapeutic boarding schools and adventure therapy or wilderness therapy programs. One such study, involving 993 students from 9 schools was presented at the 114th Annual Convention of the American Psychological Association.[4] Another study from 2001, involving 858 kids and their families enrolled in a group of seven wilderness therapy programs for a full year, has been published by the University of Idaho.[6][5]
References
[1] OBHIC Research: "The Youth Outcome Questionnaire," http:/ / www. obhic. com/ research/ does-wilderness-treatment-work. html. Outdoor Behavioral Healthcare Research Cooperative (OBHRC) at the University of Idaho [2] http:/ / www. masspartnership. com/ provider/ outcomesmanagement/ Outcomesfiles/ Tools/ YOQ. pdf [3] Ridge NW, Warren JS, Burlingame GM, Wells MG, Tumblin KM, Reliability and validity of the youth outcome questionnaire self-report, Brigham Young University (http:/ / www. ncbi. nlm. nih. gov/ pubmed/ 19693961) [4] Ellen Behrens and Kristin Satterfield, of Findings from a Multi-Center Study of Youth Outcomes in Private Residential Treatment (http:/ / www. strugglingteens. com/ news/ APAReport81206. pdfReport), Presented at the 114th Annual Convention of the American Psychological Association, New Orleans, Louisiana, August 2006 [5] http:/ / www. obhic. com/ research/ does-wilderness-treatment-work. htm Does Wilderness Treatment Work? [6] Keith C. Russell, Ph.D., Assessment of Treatment Outcomes in Outdoor Behavioral Healthcare (http:/ / www. cnr. uidaho. edu/ wrc/ Pdf/ Tech_Report_27Final. pdf), University of Idaho-Wilderness Research Center
426
427
428
A6
A7 A8
A9
Attribute Hierarchy Method The hierarchy contains two independent branches which share a common prerequisite attribute A1. Aside from attribute A1, the first branch includes two additional attributes, A2 and A3, and the second branch includes a self-contained sub-hierarchy which includes attributes A4 through A9. Three independent branches compose the sub-hierarchy: attributes A4, A5, A6; attributes A4, A7, A8; and attributes A4, A9. As a prerequisite attribute, attribute A1 includes the most basic arithmetic operation skills, such as addition, subtraction, multiplication, and division of numbers. Attributes A2 and A3 both deal with factors. In attribute A2, the examinee needs to have knowledge about the property of factors. In attribute A3, the examinee not only requires knowledge of factoring (i.e., attribute A2), but also the skills of applying the rules of factoring. Therefore, attribute A3 is considered a more advanced attribute than A2. The self-contained sub-hierarchy contains six attributes. Among these attributes, attribute A4 is the prerequisite for all other attributes in the sub-hierarchy. Attribute A4 has attribute A1 as a prerequisite because A4 not only represents basic skills in arithmetic operations (i.e., attribute A1), but it also involves the substitution of values into algebraic expressions which is more abstract and, therefore, more difficult than attribute A1. The first branch in the sub-hierarchy deals, mainly, with functional graph reading. For attribute A5, the examinee must be able to map the graph of a familiar function with its corresponding function. In an item that requires attribute A5 (e.g., item 4), attribute A4 is typically required because the examinee must find random points in the graph and substitute the points into the equation of the function to find a match between the graph and the function. Attribute A6, on the other hand, deals with the abstract properties of functions, such as recognizing the graphical representation of the relationship between independent and dependent variables. The graphs for less familiar functions, such as a function of higher-power polynomials, may be involved. Therefore, attribute A6 is considered to be more difficult than attribute A5 and placed below attribute A5 in the sub-hierarchy. The second branch in the sub-hierarchy considers the skills associated with advanced substitution. Attribute A7 requires the examinee to substitute numbers into algebraic expressions. The complexity of attribute A7 relative to attribute A4 lies in the concurrent management of multiple pairs of numbers and multiple equations. Attribute A8 also represents the skills of advanced substitution. However, what makes attribute A8 more difficult than attribute A7 is that algebraic expressions, rather than numbers, need to be substituted into another algebraic expression. The last branch in the sub-hierarchy contains only one additional attribute, A9, related to skills associated with rule understanding and application. It is the rule, rather than the numeric value or the algebraic expression that needs to be substituted in the item to reach a solution.
429
Each row and column the A matrix represents one attribute; the first row and column represents attribute A1 and the last row and column represents attribute A9. The presence of a 1 in a particular row denotes a direct connection
Attribute Hierarchy Method between that attribute and the attribute corresponding to the column position. For example, attribute A1 is directly connected to attribute A2 because of the presence of a 1 in the first row (i.e. attribute A1) and the second column (i.e., attribute A2). The positions of 0 in row 1 indicate that A1 is neither directly connected to itself nor to attributes A3 and A5 to A9. The direct and indirect relationships among attributes are specified by the binary reachability matrix (R) of order (k,k), where k is the number of attributes. To obtain the R matrix from the A matrix, Boolean addition and multiplication operations are performed on the adjacency matrix, meaning where n is the integer required to reach invariance, Algebra hierarchy is shown next. , and I is the identity matrix. The R matrix for the Ratio and
430
Similar to the A matrix, each row and column in the matrix represents one attribute; the first row and column represents attribute A1 and the last row and column represents attribute A9. The first attribute is either directly or indirectly connected to all attributes A1 to A9. This is represented by the presence of 1s in all columns of row 1 (i.e., representing attribute A1). In the R matrix, an attribute is considered related to itself resulting in 1s along the main diagonal. Referring back to the hierarchy, it is shown that attribute A1 is directly connected to attribute A2 and indirectly to A3 through its connection with A2. Attribute A1 is indirectly connected to attributes A5 to A9 through its connection with A4. The potential pool of items is represented by the incidence matrix (Q) matrix of order (k, p), where k is the number of attributes and p is number of potential items. This pool of items represents all combinations of the attributes when the attributes are independent of each other. However, this pool of items can be reduced to form the reduced incidence matrix (Qr), by imposing the constraints of the attribute hierarchy as defined by the R matrix. The Qr matrix represents items that capture the dependencies among the attributes defined in the attribute hierarchy. The Qr matrix is formed using Boolean inclusion by determining which columns of the R matrix are logically included in each column of the Q matrix. The Qr matrix is of order (k,) where k is the number of attributes and i is the reduced number of items resulting from the constraints in the hierarchy. For the Ratio and Algebra hierarchy, the Qr matrix is shown next.
The Qr matrix serves an important test item development blueprint where items can be created to measure each specific combination of attributes. In this way, each component of the cognitive model can be evaluated systematically. In this example, a minimum of 9 items are required to measure all the attribute combinations specified in the Qr matrix.
Attribute Hierarchy Method The expected examinee response patterns can now be generated using the Qr matrix. An expected examinee is conceptualized as a hypothetical examinee who correctly answers items that require cognitive attributes that the examinee has mastered. The expected response matrix (E) is created, using Boolean inclusion, by comparing each row of the attribute pattern matrix (which is the transpose of the Qr matrix) to the columns of the Qr matrix. The expected response matrix is of order (j,i), where j is the number of examinees and i is the reduced number of items resulting from the constraints imposed by the hierarchy. The E matrix for the Ratio and Algebra hierarchy is shown below. If the cognitive model is true, then 58 unique item response patterns should be produced by examinees who write these cognitively-based items. A row of 0s is usually added to the E matrix which represents an examinee who has not mastered any attributes. To summarize, if the attribute pattern of the examinee contains the attributes required by the item, then the examinee is expected to answer the item correctly. However, if the examinees attribute pattern is missing one or more of the cognitive attributes required by the item, the examinee is not expected to answer the item correctly.
431
Only after the examinee factors the second expression into the product of the first expression would the calculation
Attribute Hierarchy Method of the value of the second expression be apparent. To answer this item correctly, the examinee should have mastered attributes A1, A2, and A3.
432
Psychometric Analysis
During this stage, statistical pattern recognition is used to identify the attribute combinations that the examinee is likely to possess based on the observed examinee response relative to the expected response patterns derived from the cognitive model.
where J is the total number of items, Xij is examinee i s score (i.e., 1 or 0) to item j, Sj includes items that require the subset of attributes of item j, and Nci is the total number of comparisons for correctly answered items by examinee i. The values of the HCI range from -1 to +1. Values closer to 1 indicate a good fit between the observed response pattern and the expected examinee response patterns generated from the hierarchy. Conversely, low HCI values indicate a large discrepancy between the observed examinee response patterns and the expected examinee response patterns generated from the hierarchy. HCI values above 0.70 indicate good model-data fit.
433
434
A Sample Diagnostic Score Report for an Examinee Who Mastered Attributes A1, A4, A5, and A6
435
436
Suggested Reading
Leighton, J. P., & Gierl, M. J. (Eds.). (2007). Cognitive diagnostic assessment for education: Theory and applications. Cambridge, UK: Cambridge University Press.
External links
Center for Research in Applied Measurement and Evaluation [1]
References
[1] http:/ / www. education. ualberta. ca/ educ/ psych/ crame/ [2] Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model for cognitive assessment: A variation on Tatsuoka's rule-space approach. Journal of Educational Measurement, 41, 205-237. [3] Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 345-354. [4] Kuhn, D. (2001). Why development does (and does not) occur: Evidence from the domain of inductive reasoning. In J. L. McClelland & R. Siegler (Eds.), Mechanisms of cognitive development: Behavioral and neural perspectives (pp. 221-249). Hillsdale, NJ: Erlbaum. [5] Gierl, M. J. (2007). Making diagnostic inferences about cognitive attributes using the rule-space model and attribute hierarchy method. Journal of Educational Measurement, 44, 325-340. [6] Gierl, M. J., & Zhou, J. (2008). Computer adaptive-attribute testing: A new approach to cognitive diagnostic assessment. To appear in the Special Issue of Zeitschift fur Psychologie-Journal of Psychology, (Spring, 2008), Adaptive Models of Psychological Testing, Wim J. van der Linden (Guest Editor). [7] Leighton, J. P., & Gierl, M. J. (2007). Defining and evaluating models of cognition used in educational measurement to make inferences about examinees thinking processes. Educational Measurement: Issues and Practice, 26, 3-16. [8] Ericsson, K. A. & Simon, H. A. (1993). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press. [9] Leighton, J.P. (2004). Avoiding misconceptions, misuse, and missed opportunities: The collection of verbal reports in educational achievement testing. Educational Measurement: Issues and Practice, 23, 1-10. [10] Gierl, M. J., Wang, C., & Zhou, J. (2008). Using the attribute hierarchy method to make diagnostic inferences about examinees cognitive skills in algebra on the SAT. Journal of Technology, Learning, and Assessment, 6 (6). Retrieved October 24, 2008, from http:/ / www. jtla. org. [11] Bejar, I. I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2003). A feasibility study of on-the-fly item generation in adaptive testing. Journal of Technology, Learning, and Assessment, 2 (3). Retrieved October 24, 2008, from http:/ / www. jtla. org. [12] Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986b). Parallel distributed processing (Vol. 1). Cambridge, MA: MIT Press [13] Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986a). Learning representations by back-propagating errors. Nature, 323, 533536 [14] American Educational Research Association (AERA), American Psychological Association, National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, D.C.: AERA. [15] Huff, K., & Goodman, D. P. (2007). The demand for cognitive diagnostic assessment. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 19-60). Cambridge, UK: Cambridge University Press.
437
Description
DIF refers to differences in the functioning of items across groups, oftentimes demographic, which are matched on the latent trait or more generally the attribute being measure by the items or test.[3][4] It is important to note that when examining items for DIF, the groups must be matched on the measured attribute, otherwise this may result in inaccurate detection of DIF. In order to create a general understanding of DIF or measurement bias, consider the following example offered by Osterlind and Everson (2009).[5] In this case, Y refers to a response to a particular test item which is determined by the latent construct[6] being measured. The latent construct of interest is referred to as theta () where Y is an indicator of which can be arranged in terms of the probability distribution of Y on by the expression f(Y)|. Therefore, response Y is conditional on the latent trait (). Because DIF examines differences in the conditional probabilities of Y between groups, let us label the groups as the reference and focal groups. Although the designation does not matter, a typical practice in the literature is to designate the reference group as the group who is suspected to have an advantage while the focal group refers to the group anticipated to be disadvantaged by the test.[3] Therefore, given the functional relationship f(Y)| and under the assumption that there are identical measurement error distributions for the reference and focal groups it can be concluded that under the null hypothesis: f (Y = 1) | , G = r) = f (Y = 1) | , G = f) with G corresponding to the grouping variable, "r" the reference group, and "f" the focal group. This equation represents an instance where DIF is not present. In this case, the absence of DIF is determined by the fact that the conditional probability distribution of Y is not dependent on group membership. To illustrate, consider an item with response options 0 and 1, where Y = 0 indicates an incorrect response, and Y = 1 indicates a correct response. The probability of correctly responding to an item is the same for members of either group. This indicates that there is no DIF or item bias because members of the reference and focal group with the same underlying ability or attribute have the same probability of responding correctly. Therefore, there is no bias or disadvantage for one group over the other. Consider the instance where the conditional probability of Y is not the same for the reference and focal groups. In other words, members of different groups with the same trait or ability level have unequal probability distributions on Y. Once controlling for , there is a clear dependency between group membership and performance on an item. For dichotomous items, this suggests that when the focal and reference groups are at the same location on , there is a different probability of getting a correct response or endorsing an item. Therefore, the group with the higher conditional probability of correctly responding to an item is the group advantaged by the test item. This suggests that the test item is biased and functions differently for the groups, therefore exhibits DIF. It is important to draw the distinction between DIF or measurement bias and ordinary group differences. Whereas group differences indicate differing score distributions on Y, DIF explicitly involves conditioning on . For instance, consider the following equation: p (Y = 1 | G = g) p(Y = 1)
Differential item functioning This indicates that an examinee's score is conditional on grouping such that having information about group membership changes the probability of a correct response. Therefore, if the groups differ on , and performance depends on , then the above equation would suggest item bias even in the absence of DIF. For this reason, it is generally agreed upon in the measurement literature that differences on Y conditional on group membership alone is inadequate for establishing bias.[7][8][9] In fact, differences on or ability are common between groups and establish the basis for much research. Remember to establish bias or DIF, groups must be matched on and then demonstrate differential probabilities on Y as a function of group membership.
438
Forms of DIF
Uniform DIF is the simplest type of DIF where the magnitude of conditional dependency is relatively invariant across the latent trait continuum (). The item of interest consistently gives one group an advantage across all levels of ability .[10] Within an item response theory (IRT) framework this would be evidenced when both item characteristic curves (ICC) are equally discriminating yet exhibit differences in the difficulty parameters (i.e., ar = af and br < bf) as depicted in Figure 1.[11] However, nonuniform DIF presents an interesting case. Rather than a consistent advantage being given to the reference group across the ability continuum, the conditional dependency moves and changes direction at different locations on the continuum.[12] For instance, an item may give the reference group a minor advantage at the lower end of the continuum while a major advantage at the higher end. Also, unlike uniform DIF, an item can simultaneously vary in discrimination for the two groups while also varying in difficulty (i.e., ar af and br < bf). Even more complex is crossing nonuniform DIF. As demonstrated in Figure 2, this occurs when an item gives an advantage to a reference group at one end of the continuum while favors the focal group at the other end. Differences in ICCs indicate that examinees from the two groups with identical ability levels have unequal probabilities of correctly responding to an item. When the curves are different but do not intersect, this is evidence of uniform DIF. However, if the ICCs cross at any point along the scale, there is evidence of nonuniform DIF.
439
Odds Ratio The next step in the calculation of the MH statistic is to use data from the contingency table to obtain an odds ratio for the two groups on the item of interest at a particular k interval. This is expressed in terms of p and q where p represents the proportion correct and q the proportion incorrect for both the reference (R) and focal (F) groups. For the MH procedure, the obtained odds ratio is represented by with possible value ranging from 0 to . A value of 1.0 indicates an absence of DIF and thus similar performance by both groups. Values greater than 1.0 suggest that the reference group outperformed or found the item less difficult than the focal group. On the other hand, if the obtained value is less than 1.0, this is an indication that the item was less difficult for the focal group.[8] Using variables from the contingency table above, the calculation is as follows: = (pRk / qRk)(pFk / qFk) = (Ak / (Ak + Bk)) / (Bk / (Ak + Bk)) (Ck / (Ck + Dk)) / (Dk / (Ck + Dk)) = (Ak / Bk)(Ck / Dk) = AkDkBkCk The above computation pertains to an individual item at a single ability interval. The population estimate can be extended to reflect a common odds ratio across all ability intervals k for a specific item. The common odds ratio estimator is denoted MH and can be computed by the following equation: MH = (AkDk / Nk) (BkCk / Nk) for all values of k and where Nk represents the total sample size at the kth interval.
The obtained MH is often standardized through log transformation, centering the value around 0.[16] The new transformed estimator MHD-DIF is computed as follows:
MHD-DIF = -2.35ln(MH)
Differential item functioning Thus an obtained value of 0 would indicate no DIF. In examining the equation, it is important to note that the minus sign changes the interpretation of values less than or greater than 0. Values less than 0 indicate a reference group advantage whereas values greater than 0 indicate an advantage for the focal group.
440
441
Before presenting statistical procedures for testing differences of item parameters, it is important to first provide a general understanding of the different parameter estimation models and their associated parameters. These include the one-, two-, and three-parameter logistic (PL) models. All these models assume a single underling latent trait or ability. All three of these models have an item difficulty parameter denoted b. For the 1PL and 2PL models, the b parameter corresponds to the inflection point on the ability scale, as mentioned above. In the case of the 3PL model, the inflection corresponds to 1 + c/2 where c is a lower asymptote (discussed below). Difficultly values, in theory, can range from - to +; however in practice they rarely exceed 3. Higher values are indicative of harder test items. Items exhibiting low b parameters are easy test items.[22] Another parameter that is estimated is a discrimination parameter designated a . This parameter pertains to an item's ability to discriminate among individuals. The a parameter is estimated in the 2PL and 3PL models. In the case of the 1PL model, this parameter is constrained to be equal between groups. In relation to ICCs, the a parameter is the slope of the inflection point. As mentioned earlier, the slope is maximal at the inflection point. The a parameter, similar to the b parameter, can range from - to +; however typical values are less than 2. In this case, higher value indicate greater discrimination between individuals.[23] The 3PL model has an additional parameter referred to as a guessing or pseudochance parameter and is denoted by c. This corresponds to a lower asymptote which essentially allows for the possibility of an individual to get a moderate or difficult item correct even if they are low in ability. Values for c range between 0 and 1, however typically fall below .3.[24] When applying statistical procedures to assess for DIF, the a and b parameters (discrimination and difficulty) are of particular interest. However, assume a 1PL model was used, where the a parameters are constrained to be equal for both groups leaving only the estimation of the b parameters. After examining the ICCs, there is an apparent difference in b parameters for both groups. Using a similar method to a Student's t-test, the next step is to determine if the difference in difficulty is statistically significant. Under the null hypothesis H0: br = bf Lord (1980) provides an easily computed and normally distributed test statistic. d = (br - bf) / SE(br - bf)
Differential item functioning The standard error of the difference between b parameters is calculated by 2 2 [SE(b )] + [SE(b )] r f Wald Statistic However, more common than not, a 2PL or 3PL model is more appropriate than fitting a 1PL model to the data and thus both the a and b parameters should be tested for DIF. Lord (1980) proposed another method for testing differences in both the a and b parameters, where c parameters are constrained to be equal across groups. This test yields a Wald statistic which follows a chi-square distribution. In this case the null hypothesis being tested is H0: ar = af and b
442
= bf. r First, a 2 x 2 covariance matrix of the parameter estimates is calculated for each group which are represented by S r and S for the reference and focal groups. These covariance matrices are computed by inverting the obtained f information matrices.
Next, the differences between estimated parameters are put into a 2 x 1 vector and is denoted by V' = (ar - af, br - bf) Next, covariance matrix S is estimated by summing S and S . r f Using this information, the Wald statistic is computed as follows: 2 = V'S-1V which is evaluated at 2 degrees of freedom. Likelihood-Ratio test The Likelihood-ratio test is another IRT based method for assessing DIF. This procedure involves comparing the ratio of two models. Under model (Mc) item parameters are constrained to be equal or invariant between the reference and focal groups. Under model (Mv) item parameters are free to vary.[25] The likelihood function under Mc is denoted (Lc) while the likelihood function under Mv is designated (Lv). The items constrained to be equal serve as anchor items for this procedure while items suspected of DIF are allowed to freely vary. By using anchor items and allowing remaining item parameters to vary, multiple items can be simultaneously assessed for DIF.[26] However, if the likelihood ratio indicates potential DIF, an item-by-item analysis would be appropriate to determine which items, if not all, contain DIF. The likelihood ratio of the two models is computed by G2 = 2ln[Lv / Lc] Alternatively, the ratio can be expressed by G2 = -2ln[Lc / Lv] where Lv and Lc are inverted and then multiplied by -2ln. G2 approximately follows a chi square distribution, especially with larger samples. Therefore, it is evaluated by the degrees of freedom that correspond to the number of constraints necessary to derive the constrained model from the freely varying model.[27] For instance, if a 2PL model is used and both a and b parameters are free to vary under Mv and these same two parameters are constrained in under Mc, then the ratio is evaluated at 2 degrees of freedom.
Logistic Regression
Logistic regression approaches to DIF detection involve running a separate analysis for each item. The independent variables included in the analysis are group membership, an ability matching variable typically a total score, and an interaction term between the two. The dependent variable of interest is the probability or likelihood of getting a correct response or endorsing an item. Because the outcome of interest is expressed in terms of probabilities, maximum likelihood estimation is the appropriate procedure.[28] This set of variables can then be expressed by the following regression equation:
Differential item functioning Y = 0 + 1M + 2G + 3MG where corresponds to the intercept or the probability of a response when M and G are equal to 0 with remaining 0 s corresponding to weight coefficients for each independent variable. The first independent variable, M, is the matching variable used to link individuals on ability, in this case a total test score, similar to that employed by the Mantel-Haenszel procedure. The group membership variable is denoted G and in the case of regression is represented through dummy coded variables. The final term MG corresponds to the interaction between the two above mentioned variables. For this procedure, variables are entered hierarchically. Following the structure of the regression equation provided above, variables are entered by the following sequence: matching variable M, grouping variable G, and the interaction variable MG. Determination of DIF is made by evaluating the obtained chi-square statistic with 2 degrees of freedom. Additionally, parameter estimate significance is tested. From the results of the logistic regression, DIF would be indicated if individuals matched on ability have significantly different probabilities of responding to an item and thus differing logistic regression curves. Conversely, if the curves for both groups are the same, then the item is unbiased and therefore DIF is not present. In terms of uniform and nonuniform DIF, if the intercepts and matching variable parameters for both groups are not equal, then there is evidence of uniform DIF. However, if there is a nonzero interaction parameter, this is an indication of nonuniform DIF.[29]
443
DIF Considerations
Sample Size
The first consideration pertains to issues of sample size, specifically with regard to the reference and focal groups. Prior to any analyses, information about the amount of people in each group is typically known such as the number of males/females or members of ethnic/racial groups. However, the issue more closely revolves around whether the amount of people per group is sufficient for there to be enough statistical power to identify DIF. In some instances such as ethnicity there may be evidence of unequal group sizes such that Whites represent a far larger group sample than each individual ethnic group being represented. Therefore, in such instances, it may be appropriate to modify or adjust data so that the groups being compared for DIF are in fact equal or closer in size. Dummy coding or recoding is a common practice employed to adjust for disparities in the size of the reference and focal group. In this case, all Non-White ethnic groups can be grouped together in order to have a relatively equal sample size for the reference and focal groups. This would allow for a "majority/minority" comparison of item functioning. If modifications are not made and DIF procedures are carried out, there may not be enough statistical power to identify DIF even if DIF exists between groups. Another issue that pertains to sample size directly relates to the statistical procedure being used to detect DIF. Aside from sample size considerations of the reference and focal groups, certain characteristics of the sample itself must be met to comply with assumptions of each statistical test utilized in DIF detection. For instance, using IRT approaches may require larger samples than required for the Mantel-Haenszel procedure. This is important, as investigation of group size may direct one toward using one procedure over another. Within the logistic regression approach, leveraged values and outliers are of particular concern and must be examined prior to DIF detection. Additionally, as with all analyses, statistical test assumptions must be met. Some procedures are more robust to minor violations while others less so. Thus, the distributional nature of sample responses should be investigated prior to implementing any DIF procedures.
444
Items
Determining the number of items being used for DIF detection must be considered. No standard exists as to how many items should be used for DIF detection as this changes from study-to-study. In some cases it may be appropriate to test all items for DIF, whereas in others it may not be necessary. If only certain items are suspected of DIF with adequate reasoning, then it may be more appropriate to test those items and not the entire set. However, oftentimes it is difficult to simply assume which items may be problematic. For this reason, it is often recommended to simultaneously examine all test items for DIF. This will provide information about all items, shedding light on problematic items as well as those that function similarly for both the reference and focal groups. With regard to statistical tests, some procedures such as IRT-Likelihood Ratio testing require the use of anchor items. Some items are constrained to be equal across groups while items suspected of DIF are allowed to freely vary. In this instance, only a subset would be identified as DIF items while the rest would serve as a comparison group for DIF detection. Once DIF items are identified, the anchor items can also be analyzed by then constraining the original DIF items and allowing the original anchor items to freely vary. Thus it seems that testing all items simultaneously may be a more efficient procedure. However, as noted, depending on the procedure implemented different methods for selecting DIF items are used. Aside from identifying the number of items being used in DIF detection, of additional importance is determining the number of items on the entire test or measure itself. The typical recommendation as noted by Zumbo (1999) is to have a minimum of 20 items. The reasoning for a minimum of 20 items directly relates to the formation of matching criteria. As noted in earlier sections, a total test score is typically used as a method for matching individuals on ability. The total test score is divided up into normally 3-5 ability levels (k) which is then used to match individuals on ability prior to DIF analysis procedures. Using a minimum of 20 items allows for greater variance in the score distribution which results in more meaningful ability level groups. Although the psychometric properties of the instrument should have been assessed prior to being utilized, it is important that the validity and reliability of an instrument be adequate. Test items need to accurately tap into the construct of interest in order to derive meaningful ability level groups. Of course, one does not want to inflate reliability coefficients by simply adding redundant items. The key is to have a valid and reliable measure with sufficient items to develop meaningful matching groups. Gadermann et al. (2012),[30] Revelle and Zinbarg (2009),[31] and John and Soto (2007)[32] offer more information on modern approaches to structural validation and more precise and appropriate methods for assessing reliability.
445
Statistical Software
Below are common statistical programs capable of performing the procedures discussed herein. By clicking on list of statistical packages, you will be directed to a comprehensive list of open source, public domain, freeware, and proprietary statistical software. Mantel-Haenszel Procedure SPSS SAS Stata R Systat IRT-based procedures BILOG-MG MULTILOG PARSCALE TESTFACT EQSIRT R (e.g., 'mirt' package) IRTPRO Logistic Regression SPSS SAS Stata R Systat
References
[1] Embretson,S. E., Reise,S. P. (2000).Item Response Theory for Psychologists. New Jersey: Lawrence Erlbaum. [2] Zumbo, B.D. (2007). Three generations of differential item functioning (DIF) analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223233. [3] Camilli, G. (2006). Test fairness: In R. L. (Ed.), Educational measurement (4th ed., pp. 220-256). Westport, CT: American Council on Education. [4] Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum. [5] Osterlind, S. J. & Everson, H. T. (2009). Differential item functioning. Thousand Oaks, CA: Sage Publishing. [6] http:/ / toolserver. org/ %7Edispenser/ cgi-bin/ dab_solver. py?page=Differential_item_functioning& editintro=Template:Disambiguation_needed/ editintro& client=Template:Dn [7] Ackerman, T. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 674-691. [8] Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. [9] Millsap, R. E., & Everson, H. T. (1993). Methodological review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17(4), 297-334. [10] Walker, C. (2011). What's the DIF? Why differential item functioning analyses are an important part of instrument development and validation. Journal of Psychoeducational Assessment, 29, 364-376 [11] Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7, 105-118. [12] Walker, C. M., Beretvas, S. N., Ackerman, T. A. (2001). An examination of conditioning variables used in computer adaptive testing for DIF. Applied Measurement in Education, 14, 3-16. [13] Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748. [14] Marasculio, L. A., & Slaughter, R. E. (1981). Statistical procedures for identifying possible sources of item bias based on 2 x 2 statistics. Journal of Educational Measurement, 18, 229-248. [15] Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Erlbaum.
446
Psychometrics
Psychology
Basic types
Abnormal Biological Cognitive Comparative Cultural Differential Developmental Evolutionary Experimental Mathematical Personality Positive
Psychometrics
447
Quantitative Social
Applied psychology
Applied behavior analysis Clinical Community Consumer Educational Environmental Forensic Health Industrial and organizational Legal Military Occupational health Political Religion School Sport
Lists
Disciplines Organizations Psychologists Psychotherapies Publications Research methods Theories Timeline Topics Psychology portal
Psychometrics is the field of study concerned with the theory and technique of psychological measurement, which includes the measurement of knowledge, abilities, attitudes, personality traits, and educational measurement. The field is primarily concerned with the construction and validation of measurement instruments such as questionnaires, tests, and personality assessments. It involves two major research tasks, namely: (i) the construction of instruments and procedures for measurement; and (ii) the development and refinement of theoretical approaches to measurement. Those who practice psychometrics are known as psychometricians. All psychometricians possess a specific psychometric qualification, and while many are clinical psychologists, others work as human resources or learning and development professionals.
Psychometrics
448
20th century
The psychometrician L. L. Thurstone, founder and first president of the Psychometric Society in 1936, developed and applied a theoretical approach to measurement referred to as the law of comparative judgment, an approach that has close connections to the psychophysical theory of Ernst Heinrich Weber and Gustav Fechner. In addition, Spearman and Thurstone both made important contributions to the theory and application of factor analysis, a statistical method developed and used extensively in psychometrics.[citation needed] In the late 1950s, Leopold Szondi made an historical and epistemological assessment of the impact of statistical thinking onto psychology during previous few decades: "in the last decades, the specifically psychological thinking has been almost completely suppressed and removed, and replaced by a statistical thinking. Precisely here we see the cancer of testology and testomania of today."[2]
Psychometrics More recently, psychometric theory has been applied in the measurement of personality, attitudes, and beliefs, and academic achievement. Measurement of these unobservable phenomena is difficult, and much of the research and accumulated science in this discipline has been developed in an attempt to properly define and quantify such phenomena. Critics, including practitioners in the physical sciences and social activists, have argued that such definition and quantification is impossibly difficult, and that such measurements are often misused, such as with psychometric personality tests used in employment procedures: "For example, an employer wanting someone for a role requiring consistent attention to repetitive detail will probably not want to give that job to someone who is very creative and gets bored easily."[3] Figures who made significant contributions to psychometrics include Karl Pearson, Henry F. Kaiser, Carl Brigham, L. L. Thurstone, Georg Rasch, Eugene Galanter, Johnson O'Connor, Frederic M. Lord, Ledyard R Tucker, Arthur Jensen, and David Andrich. Psychometric, psychometrician and psychometrist appreciation week is the first week in November.
449
Psychometrics
450
Theoretical approaches
Psychometricians have developed a number of different measurement theories. These include classical test theory (CTT) and item response theory (IRT).[4][5] An approach which seems mathematically to be similar to IRT but also quite distinctive, in terms of its origins and features, is represented by the Rasch model for measurement. The development of the Rasch model, and the broader class of models to which it belongs, was explicitly founded on requirements of measurement in the physical sciences.[6] Psychometricians have also developed methods for working with large matrices of correlations and covariances. Techniques in this general tradition include: factor analysis,[7] a method of determining the underlying dimensions of data; multidimensional scaling,[8] a method for finding a simple representation for data with a large number of latent dimensions; and data clustering, an approach to finding objects that are like each other. All these multivariate descriptive methods try to distill large amounts of data into simpler structures. More recently, structural equation modeling[9] and path analysis represent more sophisticated approaches to working with large covariance matrices. These methods allow statistically sophisticated models to be fitted to data and tested to determine if they are adequate fits. One of the main deficiencies in various factor analyses is a lack of consensus in cutting points for determining the number of latent factors. A usual procedure is to stop factoring when eigenvalues drop below one because the original sphere shrinks. The lack of the cutting points concerns other multivariate methods, also.[citation needed]
Psychometrics
451
Key concepts
Key concepts in classical test theory are reliability and validity. A reliable measure is one that measures a construct consistently across time, individuals, and situations. A valid measure is one that measures what it is intended to measure. Reliability is necessary, but not sufficient, for validity. Both reliability and validity can be assessed statistically. Consistency over repeated measures of the same test can be assessed with the Pearson correlation coefficient, and is often called test-retest reliability.[10] Similarly, the equivalence of different versions of the same measure can be indexed by a Pearson correlation, and is called equivalent forms reliability or a similar term.[10] Internal consistency, which addresses the homogeneity of a single test form, may be assessed by correlating performance on two halves of a test, which is termed split-half reliability; the value of this Pearson product-moment correlation coefficient for two half-tests is adjusted with the SpearmanBrown prediction formula to correspond to the correlation between two full-length tests.[10] Perhaps the most commonly used index of reliability is Cronbach's , which is equivalent to the mean of all possible split-half coefficients. Other approaches include the intra-class correlation, which is the ratio of variance of measurements of a given target to the variance of all targets. There are a number of different forms of validity. Criterion-related validity can be assessed by correlating a measure with a criterion measure known to be valid. When the criterion measure is collected at the same time as the measure being validated the goal is to establish concurrent validity; when the criterion is collected later the goal is to establish predictive validity. A measure has construct validity if it is related to measures of other constructs as required by theory. Content validity is a demonstration that the items of a test are drawn from the domain being measured. In a personnel selection example, test content is based on a defined statement or set of statements of knowledge, skill, ability, or other characteristics obtained from a job analysis. Item response theory models the relationship between latent traits and responses to test items. Among other advantages, IRT provides a basis for obtaining an estimate of the location of a test-taker on a given latent trait as well as the standard error of measurement of that location. For example, a university student's knowledge of history can be deduced from his or her score on a university test and then be compared reliably with a high school student's knowledge deduced from a less difficult test. Scores derived by classical test theory do not have this characteristic, and assessment of actual ability (rather than ability relative to other test-takers) must be assessed by comparing scores to those of a "norm group" randomly selected from the population. In fact, all measures derived from classical test theory are dependent on the sample tested, while, in principle, those derived from item response theory are not.
Standards of quality
The considerations of validity and reliability typically are viewed as essential elements for determining the quality of any test. However, professional and practitioner associations frequently have placed these concerns within broader contexts when developing standards and making overall judgments about the quality of any test as a whole within a given context. A consideration of concern in many applied research settings is whether or not the metric of a given psychological inventory is meaningful or arbitrary.[11]
Testing standards
In this field, the Standards for Educational and Psychological Testing[12] place standards about validity and reliability, along with errors of measurement and related considerations under the general topic of test construction, evaluation and documentation. The second major topic covers standards related to fairness in testing, including fairness in testing and test use, the rights and responsibilities of test takers, testing individuals of diverse linguistic backgrounds, and testing individuals with disabilities. The third and final major topic covers standards related to testing applications, including the responsibilities of test users, psychological testing and assessment, educational testing and assessment, testing in employment and credentialing, plus testing in program evaluation and public
Psychometrics policy.
452
Evaluation standards
In the field of evaluation, and in particular educational evaluation, the Joint Committee on Standards for Educational Evaluation[13] has published three sets of standards for evaluations. The Personnel Evaluation Standards[14] was published in 1988, The Program Evaluation Standards (2nd edition)[15] was published in 1994, and The Student Evaluation Standards[16] was published in 2003. Each publication presents and elaborates a set of standards for use in a variety of educational settings. The standards provide guidelines for designing, implementing, assessing and improving the identified form of evaluation. Each of the standards has been placed in one of four fundamental categories to promote educational evaluations that are proper, useful, feasible, and accurate. In these sets of standards, validity and reliability considerations are covered under the accuracy topic. For example, the student accuracy standards help ensure that student evaluations will provide sound, accurate, and credible information about student learning and performance.
References
Bibliography
Andrich, D. & Luo, G. (1993). "A hyperbolic cosine model for unfolding dichotomous single-stimulus responses". Applied Psychological Measurement 17 (3): 253276. doi:10.1177/014662169301700307 [17]. Michell, J. B (1997). "Quantitative science and the definition of measurement in psychology". British Journal of Psychology 88 (3): 355383. doi:10.1111/j.2044-8295.1997.tb02641.x [18]. Michell, J. (1999). Measurement in Psychology. Cambridge: Cambridge University Press. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press. Reese, T.W. (1943). "The application of the theory of physical measurement to the measurement of psychological magnitudes, with three experimental examples". Psychological Monographs 55: 189. Stevens, S. S. (1946). "On the theory of scales of measurement". Science 103 (2684): 67780. doi:10.1126/science.103.2684.677 [59]. PMID17750512 [60]. Thurstone, L.L. (1927). "A law of comparative judgement". Psychological Review 34 (4): 278286. doi:10.1037/h0070288 [61]. Thurstone, L.L. (1929). The Measurement of Psychological Value. In T.V. Smith and W.K. Wright (Eds.), Essays in Philosophy by Seventeen Doctors of Philosophy of the University of Chicago. Chicago: Open Court. Thurstone, L.L. (1959). The Measurement of Values. Chicago: The University of Chicago Press. http://www.services.unimelb.edu.au/careers/student/interviews/test.html .Psychometric Assessments University of Melbourne. S.F. Blinkhorn (1997). "Past imperfect, future conditional: fifty years of test theory". Br. J. Math. Statist. Psychol 50 (2): 175185. doi:10.1111/j.2044-8317.1997.tb01139.x [19].
Psychometrics
453
Notes
[1] Kaplan, R.M., & Saccuzzo, D.P. (2010). Psychological Testing: Principles, Applications, and Issues. (8th ed.). Belmont, CA: Wadsworth, Cengage Learning. [2] Leopold Szondi (1960) Das zweite Buch: Lehrbuch der Experimentellen Triebdiagnostik. Huber, Bern und Stuttgart, 2nd edition. Ch.27, From the Spanish translation, B)II Las condiciones estadisticas, p.396. Quotation: [3] Psychometric Assessments. Psychometric Assessments . (http:/ / www. services. unimelb. edu. au/ careers/ student/ interviews/ test. html) University of Melbourne. [4] Embretson, S.E., & Reise, S.P. (2000). Item Response Theory for Psychologists. Mahwah, NJ: Erlbaum. [5] Hambleton, R.K., & Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Boston: Kluwer-Nijhoff. [6] Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen, Danish Institute for Educational Research, expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press. [7] Thompson, B.R. (2004). Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications. American Psychological Association. [8] Davison, M.L. (1992). Multidimensional Scaling. Krieger. [9] Kaplan, D. (2008). Structural Equation Modeling: Foundations and Extensions, 2nd ed. Sage. [10] Reliability definitions at the University of Connecticut (http:/ / www. gifted. uconn. edu/ Siegle/ research/ Instrument Reliability and Validity/ Reliability. htm) [11] Blanton, H., & Jaccard, J. (2006). Arbitrary metrics in psychology. (http:/ / psychology. tamu. edu/ Faculty/ blanton/ bj. 2006. arbitrary. pdf) American Psychologist, 61(1), 27-41. [12] The Standards for Educational and Psychological Testing (http:/ / www. apa. org/ science/ standards. html#overview) [13] Joint Committee on Standards for Educational Evaluation (http:/ / www. wmich. edu/ evalctr/ jc/ ) [14] Joint Committee on Standards for Educational Evaluation. (1988). The Personnel Evaluation Standards: How to Assess Systems for Evaluating Educators. (http:/ / www. wmich. edu/ evalctr/ jc/ PERSTNDS-SUM. htm) Newbury Park, CA: Sage Publications. [15] Joint Committee on Standards for Educational Evaluation. (1994). The Program Evaluation Standards, 2nd Edition. (http:/ / www. wmich. edu/ evalctr/ jc/ PGMSTNDS-SUM. htm) Newbury Park, CA: Sage Publications. [16] Committee on Standards for Educational Evaluation. (2003). The Student Evaluation Standards: How to Improve Evaluations of Students. (http:/ / www. wmich. edu/ evalctr/ jc/ briefing/ ses/ ) Newbury Park, CA: Corwin Press. [17] http:/ / dx. doi. org/ 10. 1177%2F014662169301700307 [18] http:/ / dx. doi. org/ 10. 1111%2Fj. 2044-8295. 1997. tb02641. x [19] http:/ / dx. doi. org/ 10. 1111%2Fj. 2044-8317. 1997. tb01139. x
Further reading
Borsboom, Denny (2005). Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge: Cambridge University Press. ISBN978-0-521-84463-5. Lay summary (http://www.cambridge.org/uk/ catalogue/catalogue.asp?isbn=978-0-521-84463-5) (28 June 2010). DeVellis, Robert F (2003). Scale Development: Theory and Applications (http://books.google.com/ ?id=BYGxL6xLokUC&printsec=frontcover&dq=scale+development#v=onepage&q&f=false) (2nd ed.). London: Sage Publications. ISBN0-7619-2604-6 (cloth). Retrieved 11 August 2010 Paperback ISBN 0-7619-2605-4
External links
APA Standards for Educational and Psychological Testing (http://www.apa.org/science/standards.html) Joint Committee on Standards for Educational Evaluation (http://www.wmich.edu/evalctr/jc/) The Psychometrics Centre, University of Cambridge (http://www.psychometrics.cam.ac.uk) Psychometric Society and Psychometrika homepage (http://www.psychometrika.org/) London Psychometric Laboratory (http://www.psychometriclab.com) Rasch analysis in psychometrics (http://www.rasch-analysis.com/)
As Test-Taking Grows, Test-Makers Grow Rarer (http://www.nytimes.com/2006/05/05/education/05testers. html?ex=1304481600&en=bec6ba0fec0c3772&ei=5090&partner=rssuserland&emc=rss), May 5, 2006, NY Times. "Psychometrics, one of the most obscure, esoteric and cerebral professions in America, is now also one of the hottest."
454
References
455
456
457
458
459
460
461
462
463
464
License
465
License
Creative Commons Attribution-Share Alike 3.0 Unported //creativecommons.org/licenses/by-sa/3.0/