Академический Документы
Профессиональный Документы
Культура Документы
Yuanyuan Sun
Introduction
According to Bachman & Palmer (2010), people make use of language assessments to
collect information to help make decisions both inside and outside the classroom. Some
decisions may be very significant and serious, since they have consequences for the
individuals, institutions, organizations or even societies. Suryaningsih (2014) stated that the
higher the stakes of the test, the stronger the urge to engage in specific test preparation
practices. As an English learner and future English teacher, I have personally experienced
the power of high-stakes assessment to influence peoples lives. Nowadays in China, there
are many English learners who want to pursue higher education in the U.S. In order to be
qualified candidates, they have to receive certification of standard and widely recognized
English language proficiency tests. In addition, they have to consider whether they have the
For this paper, I review two large-scale language proficiency tests that are used for
admission purposes: Test of English as a Foreign Language Internet Based Test (TOEFL
Syndicate (UCLES). Since they are both high-stakes proficiency tests, I want to focus on how
they are designed and how well they measure language ability. Moreover, I want to
investigate if the information offered by the publishers demonstrates whether the tests are
effective, reliable and valid to help with enrollment and admission decisions in higher
609-734-5410; https://www.ets.org/contact
Overview*
The TOEFL was first introduced in 1964, and it was a five-part multiple-choice test assessing
the English proficiency of non-native speakers of the language for academic placement in
U.S. colleges and universities. It was replaced by the three-section (Listening, Structure and
Written Expression, and Reading Comprehension) test in 1976. In 1986, the Test of Written
English was included in some but not all the TOEFL as a separate component. The computer-
based TOEFL was launched in 1998. After years of validation activities supporting the
design and the test score interpretation and uses of the TOEFL, ETS released iBT as the latest
version of TOEFL. The official guide to the TOEFL test (ETS, 2012) points out three key
features of iBT: it measures speaking, listening, reading and writing skills that are significant
integrated tasks combining multiple skills in real academic settings; it represents the most
popular practices in language teaching and learning nowadays such as integrated language
skills and the communicative approach. An extended description of the TOEFL iBT is
Table 1
Test Purpose The TOEFL iBT test is designed to evaluate the English proficiency of
nonnative English speakers, and it is primarily used to measure
international students English proficiency in an academic environment
such as university where instruction is in English.
(TOEFL iBT test framework, 2010)
Test Structure The entire test with 4 sections takes about 4 hours. Its administered via
the secure Internet network of testing centers.
Reading section (60-100 min) measures test takers ability to understand
university-level academic materials. The section has 3-4 passages
approximately 700 words each with 12-14 questions per passage. There
are 3 question formats: traditional multiple choice questions with four
choices and a single correct answer, questions asking test takers to insert
a sentence out of four choices to where it fits best in a passage and
reading to learn questions with more than four choices and more than
one correct answer.
Listening section (60-90 min) assesses test takers ability (making
inferences etc.) to understand academic lectures and long conversations.
The section has 4-6 lectures, each 3-5 minutes long with 6 questions; 2-3
conversations, each about 3 minutes long with 5 questions. There are
pictures accompanying each listening to show test takers the setting and
roles of speakers. Note-taking is allowed. There are four question
formats: traditional multiple-choice questions, multiple-choice questions
with more than one answer, events or steps ordering questions and
questions asking to match objects or text to categories in a chart.
Speaking section (20 min) includes six tasks in academic settings. The
first two independent speaking tasks require test takers to respond to
general topics familiar to them. Two integrated tasks require reading and
listening and another two require listening before speaking in response.
Writing section (50 min) includes two tasks. Integrated writing requires
test taker to read and listen opinions on an academic topic while taking
TEST REVIEW 5
notes then write to summarize. Independent writing asks for an essay that
states, explains and supports test takers personal opinions on an issue.
(The official guide to the TOEFL test, 2012)
Scoring of Test Each section is scored on a scale of 0-30. Total score ranges from 0-120.
There are two types of scoring in the reading and listening sections.
Questions with single correct answer are scored on correct/incorrect
basis. Questions more than one correct answer allow partial-credit.
Speaking tasks are rated by 3-6 different certified raters on a scale from
0-4 according to the rubrics. Raters look for features of delivery,
language use and topic development in responses. Writing tasks are rated
based on the 0-5 rubrics by 2 certified raters. Raters focus on
development, organization, content and language use to evaluate the
quality of writing. The average of of the scores on the speaking and
writing tasks is converted to a scaled score of 0 to 30.
Final score report includes performance feedback in detail.
(The official guide to the TOEFL test, 2012)
Statistical Sawaki, Stricker & Oranje (2009) reported a survey conducted from 2003
Distribution of to 2004, with participants from 31 countries accounting for about 80% of
the Scores for the 2001-2002 TOEFL testing volume. Summary statistics for TOEFL
Normed Group iBT scores (study sample and TOEFL 2002-2003 candidates) are listed
below:
Mean Std. Dev
Reading 17.04 6.99
Listening 16.98 6.95
Speaking 16.97 6.98
Writing 16.05 6.67
Total 67.04 24.58
Standard Error Reliability and Comparability of TOEFL iBT Scores (2011) offers the
of table showing SEM for total and part scores as below:
Measurement SEM
Reading 3.35
Listening 3.20
Speaking 1.62
Writing 2.76
Total 5.64
Evidence for ETS used data from operational tests in 2007 to produce parts and total
Reliability score reliability estimates. The reliability estimate of Reading, Listening,
Speaking, Writing and Total score are: 0.85, 0.85, 0.88, 0.72 and 0.94. As
the data shows, though the reliability of the Writing score is
comparatively lower, the reliability for the Reading, Listening, Speaking
and Total scores are relatively high. As ETS claimed in the report, for
making high-stakes decision such as admission to graduate or
TEST REVIEW 6
undergraduate school, the total score which reflects the four skills is the
most important to consider, which is the highest in the data. Meanwhile,
ETSs report also supports high alternate form of reliability, since
Zhangs research in 2008 (cited in Reliability and Comparability, 2011)
shows high correlations of 12000 test repeaters score (0.77 for the
listening and writing sections, 0.78 for reading, 0.84 for speaking, and
0.91 for the total test score) in two iBT they took within a month.
(Reliability and Comparability, 2011)
Evidence for Stoynoff (2009) listed some evidence to justify the TOEFL iBTs
Validity construct validity. The iBT listening section, compared to CBT, includes
more and longer spoken discourse and new item types (complex tasks) to
assess important listening construct. The writing construct is also defined
more broadly in the iBT to include pragmatic considerations, complex
task with integrated skills as well as additional rhetorical functions, which
help to increase authenticity of tasks. Stoynoff (2009) also presents
research that Rosenfeld and his colleagues conducted in 22 North
American universities in 2001. Their corpus analysis data informing the
design and content of input used to assess listening comprehension in
undergraduate and graduate courses indicated that the content of the iBT
test approximates what could be encountered in the target language use
situation.
In the publication, Validity Evidence Supporting the Interpretation and
Use of TOEFL iBT Scores (2008) produced by ETS, the evidence for
valid score interpretation and use is provided. ETS claims the score of
TOEFL iBT is a reliable indicator to tell if the student has sufficient
English-language proficiency for study at an English-medium college or
university. This is shown by the relationship between TOEFL iBT test
scores and other measures or criteria of language proficiency such as self-
assessment, academic placement and local institutional tests for
international teaching assistants. The study statistics can be found in the
publication, all proving TOEFL iBT a valid indicator for learners
proficiency. For example, regarding self assessment, over 2,000 test
takers in the 2003-2004 completed the questionnaire, which indicated
how well the test takers agreed with a series of can do statements
including a mix of simple and complex speaking tasks such as My
instructor understands me when I ask a question in English. and I can
talk about facts or theories I know well and explain them in English. The
result showed it was more likely for test takers with higher scores to
indicate that they could do more complex tasks.
TEST REVIEW 7
*Abbreviated from OSullivan 2005, updated information from IELTS Handbook (2007)
and IDP: 1 Hills Road Cambridge, CB1 2EU United Kingdom; Tel:
ielts@CambridgeESOL.org
Target Population: Students for whom English is not a first language and who wish to
China
Overview
provides two test routes. IELTS Academic is for people seeking higher education at the
undergraduate or graduate level, and IELTS General Training is for people who plan to work,
purposes in an English-speaking environment. The test routes share the same listening and
speaking tests. An extended review for the IELTS Academic test is provided in Table 2:
TEST REVIEW 8
Table 2
Test Purpose The test assesses whether test-takers are ready to study or train
in English speaking environment, and reflects the features of
language used in academic study at an undergraduate or
postgraduate level. Its also designed for admission purposes to
undergraduate or postgraduate courses.
(IELTS Handbook, 2007)
Test Structure IELTS Academic includes Listening, Reading, Writing and
Speaking tests. Paper-based IELTS are offered by all test
centers.
IELTS is approximately 2 h and 30 min long in total.
The listening test (30 min) has 4 sections (2 about social needs,
2 about situations in educational or training contexts) with 40
questions including: multiple choice; short answer; sentence
completion; note/summary/flow-chart/table completion;
labelling a diagram; classification; matching. Listening excerpts
including conversations and monologues are played only once.
The reading test (60 min) has 3 reading passages (totally 2000-
2750 words) with 40 questions including multiple choice, short-
answer questions, sentence completion, note/summary/flow-
chart/table completion, labelling a diagram, matching headings
for identified paragraphs/sections of the text, identification of
writers views/claims, identification of information in the text,
classification, matching lists/ phrases. Texts are of general
interests, taken from magazines, journals, books and
newspapers.
The writing test (60 min) includes 2 tasks. Task 1 asks test-taker
to describe information in graph/table/chart/diagram. Task 2
requires test-takers to present argument, view or problem
towards issues regarding to general interest.
The speaking test (11-14 min) is an interview including 3 parts
for test-takers to complete. Part 1 is basic introduction and
answering verbal questions on familiar topics. Part 2 is speaking
on a topic based on written input (general instruction and
content-focused prompt). Part 3 is interactive discussion on Part
2 topic.
(IELTS Handbook, 2007)
Scoring of Test Test-takers are assessed on a Band Scale from 1-9. Each test
component is assessed individually, and the individual scores
TEST REVIEW 9
Module SEM
Listening 0.37
Academic Reading 0.38
Evidence for Reliability Reliability estimates were reported for Listening and Reading
Test Modules in the online report of test performance 2015,
which is listed as below:
Module Alpha
Listening 0.92
Academic Reading 0.90
TEST REVIEW 10
Discussion
speakers whose L1 is Chinese, and they are preparing for a language proficiency test so that
they can pursue admission to higher education institutions in the U.S. Students may have
different education backgrounds, but all of them have at least obtained a junior high school
diploma, and they are working on enrollment in undergraduate or graduate school in the U.S.
TEST REVIEW 11
Furthermore, students in this class may have been assessed by the IEP to be placed into the
same class. Their English proficiency levels may vary but most students range from
intermediate to advanced. Since all the students are learning English for admission to higher
education, which may very possibly influence their future lives, they are very motivated to go
to class and achieve class objectives. The class materials offered by IEP should target the
specific English proficiency test. Among many large-scale proficiency tests in China, as the
instructor, I am supposed to recommend a proficiency test for students for their best interest.
Both TOEFL and IELTS are the most popular and widely accepted English
proficiency tests in China. According to Arcuino (2013), TOEFL scores are recognized by
over 8,500 agencies and educational institutions in over 130 countries, while the IELTS is
administered in over 135 countries and the scores are accepted by over 7,000 educational
institutions around the world. The IEPs which prepare the student group I described for
English proficiency tests usually choose TOEFL or IELTS to set their curriculum.
Considering the students goal of getting English language proficiency scores to be granted
admission to universities in the U.S, both assessments should both work. However,
considering the context and the test review, I think TOEFL is more appropriate than IELTS.
First of all, in terms of reliability, according to Stoynoff (2009), ETS uses multiple
Writing tasks and examiners assessments are monitored daily. All these procedures help to
maintain the consistency of both iBT section scores and total scores. On the other hand,
though IELTS claims that they have sufficient rater reliability, Chalhoub-Deville & Turner
TEST REVIEW 12
(2000) stated that the information to support the reliability of IELTS is lacking and limited,
and IELTS publications need to provide more documentation of rater reliability, the
In terms of validity, both proficiency tests include integrated skills into their tasks.
Sawaki et al. (2009) indicates that integrated tasks in TOEFL iBT reflect the complex
context of language use in academic settings. The integrated tasks require students to
textbook to prepare for a lecture, listening to the lecture and writing down key points of the
lecture for future reference (Sawaki et al., 2009). This would benefit students a lot only if
they get into a real academic setting in the U.S. However, as for IELTS, as shown in the test
review table above, many tasks focus on topics of general interests. Additionally, according
to Chalhoub-Deville & Turner (2000), IELTS has been more commonly used in the UK and
Australia. It seems like research is needed to see if the scores obtained from IELTS are
difficult to be certain of what the ensuing test scores mean and how they should be used in
academic contexts in the U.S. Since the students in my classroom are targeting getting into
and succeeding in U.S. classrooms, TOEFL, in this case, would be a better choice.
There is one noteworthy feature of the IELTS, according to Stoynoff (2009), which
shows better authenticity and interactiveness compared to TOEFL, that is the speaking test in
method used in TOEFL iBT. However, TOEFL iBT also use pictures accompanying tasks,
TEST REVIEW 13
which to some degree, may help with students performance and engage them in authentic
contexts.
TEST REVIEW 14
References
Arcuino, C. L. T. (2013). The relationship between the test of English as a foreign language
(TOEFL), the international English language testing system (IELTS) scores and
State University). Available from Dissertations & Theses @ Colorado State University;
https://search-proquest-
com.ezproxy2.library.colostate.edu/docview/1413309058?accountid=10223
Bachman, L., & Palmer, A. (2010). Language assessment in practice. New York, NY:
Chalhoub-Deville, M., & Turner, C. E. (2000). What to look for in ESL admission tests:
ETS. (2012). The official guide to the TOEFL test (4th ed.). New York, NY: McGraw-Hill
Education.
Stoynoff, & C. Chapelle, ESOL tests and testing (pp. 7378). Alexandria, VA: TESOL.
Reliability and comparability of TOEFL iBT scores. (2011). Insight: TOEFL iBT Research,
1(3), 1-8.
TEST REVIEW 15
Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL internet-
Stoynoff, S. (2009). Recent developments in language assessment and the case of four large-
Stoynoff, S., & Chapelle, C. (2005). ESOL tests and testing. Alexandria, VA: TESOL.
system (IELTS) and test of English as a foreign language (TOEFL) tests (Doctoral
Dissertations & Theses Global: Literature & Language; ProQuest Dissertations &
com.ezproxy2.library.colostate.edu/docview/1534350812?accountid=10223
TOEFL iBT test framework and test development. (2010). Insight: TEOFL iBT Research,
1(1), 1-9.
Validity evidence supporting the interpretation and use of TOEFL iBT scores. (2008).
https://www.ielts.org/teaching-and-research/test-performance.