Вы находитесь на странице: 1из 15

Running head: RELIABILITY REVIEW OF PIAT-R/NU

The Normative Update of the Revised Peabody Individual Achievement Test - A Review of Reliability Jenna M. Powell Lamar University

RELIABILITY REVIEW OF PIAT-R/NU

Abstract Individually administered achievement tests are delivered to students for a variety of reasons. The Peabody Individual Achievement Test (PIAT) is one that was originally created in 1970 by Frederick C. Markwardt, Jr., PhD. The purpose of the test was to assess academic achievement for students in grades kindergarten to grade 12 (K-12). Originally, five subtest subjects were created to achieve this goal. In 1986, this test was revised and a total of six subtests existed. A sample of 1,563 students in grades K-12 were tested in the United States for the 1986 revision and the results were carefully analyzed through several methods of reliability testing. The content implemented during the revision remains the same, however, in 1995-6, the normative update was created and 3,184 students were sampled. The testing process of the normative update involved four other individually administered achievement tests using an approach called the domain-norming approach. Each student was given one of the five full tests as well as one or several subtests, ensuring one-fifth of each grade level received the PIAT-R. Normative update results of 1996 were then compared with the revised results of 1986.

This reliability review mainly focuses on the revision and normative update of the PIAT. It also discusses each subtest and describes the sampling process, norm development, reliability testing methods, demographic variables used, and administration and interpretation process.

RELIABILITY REVIEW OF PIAT-R/NU

The Normative Update of the Revised Peabody Individual Achievement Test - A Review of Reliability

The Peabody Individual Achievement Test (PIAT-R/NU) is the revised/normative updated version of the classic 1970 version of the Peabody Individual Achievement Test (PIAT), an individually administered measure of academic achievement. The test was revised in 1989 and 1997 is when the norms were updated. Frederick Markwardt, Jr. designed the test to evaluate the academic achievement of students in kindergarten through the 12th grade (K-12) in six areas of content known as the subtests. The test contains two main formats: multiple choice and free response which includes verbal and writing components. A major feature of the revision is the addition of the Written Expression subtest. The subtests are: (a) General Information, (b) Reading recognition, (c) Reading Comprehension, (d) Mathematics, (e) Spelling, and the newest subtest, and (f) Written Expression. Along with the six subtests, there are three composites. They are (a) Total Reading, (b) Total Test, and (c) Written Language Composite. (Markwardt, 1997) During the revision, the order of subtests was changed in hopes of increasing the subjects motivation as well as interest level. (Markwardt, 1997) The General Information subtest measures the general knowledge of the subject and the question is read verbally with an oral response to follow. The Reading Recognition subtest is an oral reading test the subject reads items out loud. In the Reading Comprehension subtest, the subject reads a sentence and chooses a picture to illustrate what the sentence stated, measuring the comprehension of what they read. Multiple choice is offered for the Mathematics subtest, which measures conceptual application and knowledge of mathematical facts. All items are arranged in ascending difficulty order. The

RELIABILITY REVIEW OF PIAT-R/NU

Spelling subtest measures the ability to recognize letters from sounds at the beginning of the subtest and then items are measured by the subjects recognition of standardized spellings in a later part of the subtest. In the newest subtest, Written Expression, there are two levels. Level One is for kindergarten through first grade and the subject is asked to copy and write letters, words and sentences from dictation. Level Two is for grades 2-12 and the subject is asked to write a story after being shown a picture. (Markwardt, 1997)

The PIAT-R items measure functional knowledge and general abilities that are not specific to a particular curriculum. This test is objectively scored without a time constraint, although the average time administration time is about 60 minutes. PIAT-R is helpful when a scholastic survey is needed and it helps in the selection of a more diagnostic instrument. The Total Reading composite is obtained by summing the reading recognition and reading comprehension subtests raw scores. The Total Test composite is determined by the sum of the general information, reading recognition, reading comprehension, math and spelling subtests raw scores. Finally, the Written Language composite is the sum of scaled scores for the spelling and written expression subtests. (Markwardt, 1997)

The test is administered individually, due to the design and materials involved. There are four test plates and test easels which contain all six subtests. These plates are created specifically to capture and hold the interest of each subject, regardless of sex, age, intellect, and cultural background. In an attempt to balance the representation of ethnic groups, sex and race among the test items, contemporary artwork is used throughout the test. (Markwardt, 1997)

RELIABILITY REVIEW OF PIAT-R/NU

The PIAT-R is based on a national sample of 1,563 subjects that were tested in the spring of 1986. (Markwardt, 1997) These subjects are representative of the total school population (K-12) based on parental socioeconomic status, sex, geographic region, race/ethnic group and grade. Information obtained was based on U.S. Census Bureau and U.S. Population. This sample size is extremely small and limits actual representation. The parental socioeconomic status was divided into four categories: less than high school education, high school graduate, one to three years of college or education beyond high school, and four or more years of college. 42 percent were at the high school graduate level, while the other three ranged from 18.9 percent to 19.9 percent. The target value for sex was 50 percent and they came very close to their target with female subjects totaling 50.1 percent and males subjects totaling 49.9 percent. Geographic region was divided into four categories and did not include Hawaii or Alaska. The four divisions and percentages of students are: Northeast with 19.1 percent, North Central with 25.6 percent, South with 32.8 percent, and finally West with the remaining 20.6 percent. The divisions given are a close comparison to percentage of population per region according to Markwardt, however, only 20 states are represented out of the upper 48. This is not a fair depiction of every student in the upper 48 in grades K-12. Race/ethnic group was also divided into four categories. The categories and student percentages are as follows: White was the majority with 73.3 percent, Black with 14.3 percent, Hispanic with 9.7 percent and Other with 2.7 percent. The Other category includes Asians, Pacific Islanders, Native Americans, and Alaskan Natives. (Markwardt, 1997) According to the US Population data in 1985, the other category is 3.2 percent and only 2.7 percent is represented in this study.

RELIABILITY REVIEW OF PIAT-R/NU

Before we discuss the major uses of the PIAT-R, lets discuss the differences from the 1970 version and the most current, 1989 revised edition. As mentioned above, the most notable changes are the order of subtest and the addition of the Written Expression subtest. The primary reason for revision, however, was to bring in more current content. A small 35 percent of the original items remained from the PIAT to the PIAT-R. (Markwardt, 1997) Another addition was the Total Reading composite score, which measures overall achievement in the area of reading. The revision also consisted of increasing the number of items in the original five subtests. In four subtests (General Information, Math, Reading Recognition, and Spelling), the items went from 84 to 100, while in Reading Comprehension, the number went from 66 to 82 items. (Markwardt, 1997)

There are seven major implemental uses for the PIAT-R according to Markwardt. The first is called individual evaluation, which helps gain insight to the subjects existing knowledge, strength and weakness of education and testing behavior. Program planning is another way the PIAT-R is used. This helps develop a course of action to meet unique needs of the subject. The test can also help aid parents and students understand the subjects strengths and weaknesses when deciding future plans through guidance and counseling. Once the subjects general level of accomplishment is determined, school placement is easily achieved and the subject can also be transferred or admitted to a new school determined on these scores. This test is also used to group students by achievement level. The follow-up evaluation provides a measure of educational intervention at times when a more precise test is appropriate. The final use is termed personnel selection and training. The subjects level of achievement is used for employment selection or guiding employee to educational programs as appropriate. In addition to the seven

RELIABILITY REVIEW OF PIAT-R/NU

implemental uses of the PIAT-R/NU, there are five research uses. These research uses are: longitudinal, demographic, program evaluation, basic research and validation studies. (Markwardt, 1997)

The administration and interpretation of the PIAT-R is relatively simple. The administer presents items, records responses and then calculates scores. There is only one test manual containing both technical and administrative guidelines to create ease for the user. While test administration may be handled with little study, score interpretation of the tests may not be completed by one unfamiliar with the understanding of psychometrics. Educational curriculum and implications are required for accurate interpretation. (Markwardt, 1997) A flaw to mention here is that the test manual never states in detail a required degree or experience level of neither administrator nor interpreter but merely lists a few examples of professions that can do so effectively. Interpretation focuses on the actual meaning of the scores, determining the confidence level that can be placed on the scores as well as the prediction of future behaviors. As previously discussed, this test is not meant to be a diagnostic tool itself. It is not created to provide a precise achievement or assessment on a high level. Also, items are selected to sample a mere cross section of various curriculum across the United States and not specific to any one individual school system. People without the proper background are open to incorrectly interpret the data, due to the brief and simple administration process. This high potential to misinterpret and misuse the test is mentioned as well as several pitfalls to avoid.

Before we discuss the reliability tests used, we should take a moment to discuss the development of the norms used for standardization. In 1986, spring data was collected from all grades and was standardized, except for kindergarten, which used fall data. With the exception of Written

RELIABILITY REVIEW OF PIAT-R/NU

Expression, each subtest and composite were generated with a standard deviation of 15 and a mean of 100, which is typical for an IQ z-score. Additionally, percentile ranks, stanines, grade and age equivalents and normal curve equivalents were developed. The subtest standard scores of the five subtests were derived through careful calculation and smoothing. (Markwardt, 1997) The function of smoothing is to smooth out the average of a data set by adding more data to move the average within a normal distribution. It is ethical if done when the data set is normal. Composite standard scores were computed by adding the raw scores of subtests for each student at each age and grade resulting in the distributions used to find said scores. Equivalents for age and grade were each plotted and calculated. Now that the normalized standard scores were discovered, the normal curve equivalents, percentile ranks and stanines could be distinguished from the normal curve table. (Markwardt, 1997) As mentioned before, written expression norms must be converted in another way. They do not have an adequate range and findings showed it inappropriate to develop the standard score and age/grade norms. Therefore stanines for Levels one and two are derived from grade-based stanines and Level two has an additional developmental scaled score. (Markwardt, 1997)

The normative update was published in 1997 using data collected in 1995-96. (Markwardt, 1997) Four other individually administered achievement tests were used in the program to perform the update. Two of the tests included are the Brief and Comprehensive versions of the Kaufman Test of Educational Achievement (K-TEA). The other two tests are the Key Math Revised test (KeyMath-R) and the Revised Woodcock Reading Mastery Test (WRMT-R). Five domains were also included in this standardization process to create a cross-battery domain of subtests. They were spelling, reading comprehension, math computation, word reading, and

RELIABILITY REVIEW OF PIAT-R/NU

math applications. The approach used is referred to as the domain-norming approach where each person being tested is administered one full achievement test from the five listed. This test is referred to as their primary test and they are also given one or more subtests from other tests. Rasch scaling was applied to the five domains listed above while Written Expression and General Information subtests obtained standard scores normalized through the smoothing process which was explained earlier. (Markwardt, 1997)

The sample size used for the normative update was 3,184 students in kindergarten through grade 12 as well as a variety of educational statuses for an additional 245 young adults aged 18-22. (Markwardt, 1997) This sample size, again, is limiting and does not accurately represent the population. The fact that they only used English speaking people also is not a fair representation. The sampling was based primarily on the 3,184 grade norm samples (K-12) and less emphasis on the 245 age norm samples. (Markwardt, 1997) The demographic variables used for the normative update were similar to the revised variables and they are the same for the grade-norm and age-norm samples except for one variation. For grade-norm, the variables were sex, race/ethnicity, parental education, educational placement, and geographic region. The age-norm swaps out educational placement with educational status.

The demographic variables were compared to US population data from March of 1994 and they are mostly in line with one another for both grade-norm and age-norm samples. Parental Education, Region of the country and Race/Ethnicity are all divided into four categories each for both samples, similar to the revised version of the test. The main difference is Educational Placement for grade-norm and Educational Status for age-norms. For Educational Placement, Special Education sample accounts for 10.8 percent and the US population shows 10.2 percent.

RELIABILITY REVIEW OF PIAT-R/NU

10

The Gifted sample, however, shows only 2.3 percent represented in the sample but 4.2 percent represents the US population. (Markwardt, 1997) Only half is represented here.

Sample selection was done by a random procedure which pooled the permission forms received together, and then used the computer to select. (Markwardt, 1997) Each examinee was randomly assigned one of the five tests in an organized fashion to ensure one fifth took each test per grade. The development of the norms was obtained from Rasch distribution of ability scores per grade or age. Sampling error was avoided by using a smoothing method developed by Poste and Traub (1990). (Markwardt, 1997)

While comparing the normative update scores of 1996 with the revised scores of 1986, one must take into account several changes over the ten year time frame. Modernized cultural environment, educational curriculum, and population demographics can all have an effect on the results. The overall change in scores was a decline in below-average students in grades 1-12 and the greatest improvement was the Math and Reading Comprehension subtests. The breakdown in change per subtest is as follows: General Information increased performance with above average students in grades 1-3 and below average students declined slightly in grades 4-7. Reading Recognition average level students decreased performance in grades 2 and 3, higher performance was noticed for above average students in grades 1-2, and below average students declined in grades 1-12. Reading Comprehension decrease in performance for average students in grades 1-2 and increase for average students in grades 8-12. Total Reading decline performance for average students in grades K-2, increase in performance for above average students in grades 1-2, and decrease in performance for below average students in grades 1-12. Math average students decreased performance in grades 1-3 and increased performance in

RELIABILITY REVIEW OF PIAT-R/NU

11

grades 5-12, while above average students increased performance in grades 2-12. We see the same pattern with below average students declining in grades 1-12 on this subtest. Spelling Same pattern seen here with below average students with a decline in grades 1-12, above average students showed an increase in performance in grades K-1, and decline for average students in grades 1-3. Total Test performance decline in grades 1-3 and increase in grades 7-9 for average students, increase in above average students for K-8, and decreased performance for below average students in grades 1-9. Written Expression Level I showed a decline and Level II showed increase for grades 4-12. Written Language Composite average and below average students showed a decline in grades K-1. (Markwardt, 1997) While a chart is provided, reliability was not discussed in detail for the normative update as it was for the revised test.

Having said all that, we move on to the reliability of the revised test. Four methods were used to determine a slightly different perspective regarding reliability for the PIAT-R. They are: splithalf, item response theory, Kuder-Richardson and test-retest. (Markwardt, 1997) In order to create a better understanding, each test will be clearly defined first, and then the results will be stated. The first reliability test mentioned is the split-half reliability test. This is used to show performance consistency on each subtest. It was followed by the Spearman-Brown prophecy formula to estimate the full tests reliability as a whole. (Markwardt, 1997) Although the Spearman-Brown formula is known to inflate reliability results, it is acceptable here as a followup to the split-half reliability test. Coefficients are considered reliable the closer to 1 they reach, without ever reaching 1. The results for the PIAT-R are presented by grade only for the first five subtests, as the Written Expression subtest is interpreted and measured differently and will be discussed separately. Split-half results for subtests and composites by grade show a median of

RELIABILITY REVIEW OF PIAT-R/NU

12

.98 while split-half for the subtests and composites for the sample by age show a close .99. These high coefficients are partly due to the operational rule that all items below base rate are counted as correct and all items above ceiling rate count as incorrect. The next reliability test used was the Kuder-Richardson reliability test. This test measures the consistency of all items and shows the amount of measurement error in the test. The result for the Kuder-Richardson has the same median numbers as the split-half, which shows the content to have a high homogeneity level. The next reliability test is known as the test-retest reliability which simply shows the consistency of scores from one administration of the test to another. The subject is given the test, and then given the same test again after some period of time has lapsed. One year is the ideal amount of time, however, in this situation fifty randomly selected subjects were retested from two to four weeks after the original test date which is not a sufficient amount of time to show true reliability. The test-retest reliability coefficient median for selected grades in random sample was .96 and for selected ages was also .96. Another reliability test used was the Item Response Theory Reliability. This method gives different estimates of error variance as well as true score and relies on the seven assumptions of Classical True Score Theory. Its based on the idea that the probability of a correct response is a combination of error variance and true score. (Markwardt, 1997) The total test coefficients for both grade and age using this method are by far the highest, both with a median of .99. The median reliability coefficients for the total test all showed in the high 90s, however, if we break down the actual numbers of reliability, there are definitely some subtests that need revising. In Mathematics, the split-half reliability is a low .84 for kindergarten subjects.

RELIABILITY REVIEW OF PIAT-R/NU

13

The PIAT-R briefly discusses the standard error of measurement (SEM) which is one of the vital conclusions of the Classical True Score Theory. The coefficients we have discussed thus far are helpful in determining a group of subjects but do not allow interpretation of the test scores for an individual. This is where SEM is useful. Using the split-half coefficients, standard and raw scores were computed then rounded to the nearest tenth. The values were then smoothed and transformed into whole numbers so raw score confidence intervals could be computed. Before smoothing, the median of total test by grade and age was 5.8. After smoothing, the median for grade was 2 and age was 1.8. (Markwardt, 1997) The standard deviation used here was 15 showing the SEM well below the standard deviation proving a moderate confidence interval after smoothing.

Levels one (I) and two (II) of the Written Expression subtest is examined and scored differently because it contains different properties than the other subtests. The three reliability tests used for this subtest are test-retest, interrater and internal consistency. (Markwardt, 1997) Level I test (first grade and kindergarten level) were scored by two individuals independently working and the results were correlated. First grade had a correlation of .95 and kindergarten had a correlation of .90. An interrater reliability test was also used interrater tests are composed of a predetermined value for each item, and then scored appropriately. The coefficient for interrater of level one was .88 for first grade and .91 for kindergarten. Internal consistency reliability uses the coefficient alpha formula and the outcome was considered a moderate level. For first grade, spring standardization was used and for kindergarten both fall and spring data was used resulting in .60 to .69. (Markwardt, 1997) The reason given for the likely reason of this moderate data was the fact that the content is not homogeneous and the sample size is small. In Level II written

RELIABILITY REVIEW OF PIAT-R/NU

14

expression, there were two different prompts (pictures) used and will be referred to as Prompt A and Prompt B. Level two (2-12 grades) used three types of reliability testing as well, but instead of test-retest, they used alternate-form test. Coefficient alpha ranges from .69 to .91. Total standardization samples were .86 for Prompt A and Prompt B had .88. Interrater correlations have a median of .58 for Prompt A and Prompt B had a median of .67. Internal consistency shows a median of .57 for Prompt A and .67 for Prompt B. The alternate-form portion of the reliability test included about 35 subjects randomly selected using a picture prompt that differed from the original prompt and took place two to four weeks after initial test. Coefficient alpha for the total sample was .63. (Markwardt,1997) Extremely low reliability is found in both Prompt A and Prompt B.

To summarize, the PIAT was devised to measure the academic achievement for students aged kindergarten to grade 12. The results shown lean towards the accomplishment of this goal for the revised version, however, the reliability is somewhat flawed. The sample size in both revision and normative update are extremely underrepresented of the population size at large. Another major error is the extremely short time frame in between test-retest. Neither the revised version nor the normative update had every state in the US represented, however, the test results are supposed to represent the entire US. A few obvious things that could have helped support their high reliability numbers are: increase sample size, include all 50 states, and increase the time between test-retesting (a year is ideal). These changes could be a great start to increase reliability for the next version of the Peabody Individual Achievement Test.

RELIABILITY REVIEW OF PIAT-R/NU

15

References Markwardt, F. C. (1998). Peabody Individual Achievement Test Revised Normative Update. Minneapolis, MN: NCS Pearson, Inc.

Вам также может понравиться