Вы находитесь на странице: 1из 4

(NOTE: This article is the pre-release draft of an article to appear in

the summer issue of Cambridge University Press publication,


Cambridge Connections. Final version of the article to be posted
later)
The New General Service List: A Core Vocabulary for EFL
Students & Teachers
Dr. Charles Browne, Meiji Gakuin University
Dr. Brent Culligan, Aoyama Gakuin Womens Junior College
Joseph Phillips, Aoyama Gakuin Womens Junior College
The English Language has a surprisingly large number of words.
Even if we count words like ACCEPT, ACCEPTS, ACCEPTING and
ACCEPTABLE as part of the same word family, there are still more
that 500 million words in English! Fortunately for teachers and
students, language has built in redundancy, with certain words
occurring much more frequently than others (the word THE, for
example, makes up 6-7% of all the words in any book, magazine or
newspaper). Because of this, the average native speaker of English
knows only a small percentage of these half million words (about
22,000 words for a recent college graduate).
Although 22,000 words may sound like a daunting number there is
more good news. Corpus linguistics, the science of analyzing large
collections of texts, has shown that knowledge of just a few
thousand of the most important words can give an astonishing
degree of coverage of English used in daily life. In 1953, Michael
West published a list of about 2000 important vocabulary words
known as the General Service List (GSL). Based on more than two
decades of pre-computer corpus research and a corpus size of 2.5
million to 5 million words, the GSL gives about 84% coverage of
general English. However, as useful and helpful as this list has been
to us over the decades, it has been criticized for (1) being based on
a corpus that is both dated and small by modern standards and (2)
for not clearly defining what constitutes a word.
On the 60th anniversary of Wests publication of the GSL, we would
like to announce the creation of a New General Service List (NGSL)
that is based on a carefully selected 273 million-word subsection of
the 1.6-billion-word Cambridge English Corpus (CEC). Following
many of the same steps of West and his colleagues (as well as the
suggestions of Professor Paul Nation, project advisor and a leading
figure in modern second language vocabulary acquisition), we have
tried to combine the strong objective scientific principles of corpus
and vocabulary list creation with useful pedagogic insights to create

a list of approximately 2800 high frequency words which meet the


following goals:
1. to update and expand the size of the corpus used (273 million
words) compared to the limited corpus behind the original GSL
(about 5 million words), with the hope of increasing the
generalizability and validity of the list
2. to create a NGSL of the most important high-frequency words
for second language learners of English which gives the
highest possible coverage of English texts with the fewest
words.
3. to make a NGSL that is based on a clearer definition of what
constitutes a word
4. to be a starting point for discussion among interested scholars
and teachers around the world, with the goal of updating and
revising the list based on this input (in much the same way
that West did with the original Interim version of the GSL)
The NGSL: A word list based on a large, modern corpus
Utilizing a range of computer-based corpus tools, we began
developing the NGSL with an analysis of the Cambridge English
Corpus (formerly known as the Cambridge International Corpus).
The CEC is a 1.6 billion-word corpus of the English language that
contains both written and spoken data of British and American
English. The initial corpus was created using a subset of the 1.6
billion-word CEC that was queried and analyzed using the
SketchEngine (2006) (http://www.sketchengine.co.uk). The size of
each sub-corpus that was initially included is outlined in Table 1:
Table 1. CEC corpora used for preliminary analysis of NGSL
Corpus
Newspaper
Academic
Learner
Fiction
Journals
Magazines
Non-Fiction
Radio
Spoken
Documents
TV
Total

Running Words
748,391,436
260,904,352
38,219,480
37,792,168
37,478,577
37,329,846
35,443,408
28,882,717
27,934,806
19,017,236
11,515,296
1,282,909,322

Upon revision, the Newspaper and Academic corpora were removed


from the compilation. The Newspaper corpus was removed because
its enormous size (748,391,436 running words) dominated the total

frequencies and it also showed a marked bias towards financial


terms. The academic sub-corpus (260,904,352 words) was removed
because it was a specific genre not directly related to general
English. The final 273-million-word corpus is far more balanced as a
result.
The resulting word lists were then cleaned up by removing proper
nouns, abbreviations, slang and other noise, and excluding certain
word sets such as days of the week, months of the year and
numbers. Then we used a series of computations to combine the
frequencies from the various sub-corpora while adjusting for
differences in their relative sizes. Based on a series of meetings and
discussions with Paul Nation about how to improve the list, the
combined list was then compared to other important lists such as
the original GSL, the BNC and COCA to make sure important words
were included/excluded as necessary.
The NGSL: More coverage for your money!
One of the important goals of this project was to develop a NGSL
that would be more efficient and useful to language learners and
teachers by providing more coverage with fewer words than the
original GSL. For a meaningful comparison between the GSL and
NGSL to be done, the words on each list need to be counted in the
same way. A comparison of the number of word families in the
GSL and NGSL reveals that there are 1964 word families in the
former and 2368 in the latter (using level 6 of Bauer and Nations
1993 word family taxonomy). Coverage within the 273 million word
CEC is summarized in Chart 1, showing that the 2368 word families
in the NGSL provide 90.34% coverage while the 1964 word families
in the original GSL provide only 84.24%. That the NGSL with
approximately 400 more word families provides more coverage than
the original GSL may not seem a surprising result, but when these
lists are lemmatized, the usefulness of the NGSL becomes more
apparent as the more than 800 fewer lemmas in the NGSL provide
6.1% more coverage than is provided by Wests original GSL.
Vocabulary
List
GSL
NGSL

Number of Word
Families
1964
2368

Number of
Lemmas
3623
2818

Coverage in CEC
Corpus
84.24%
90.34%

Where to find the NGSL:


The list of 2818 words is now available for download, comments and
debate from a new website weve dedicated to the development of
this list:
www.newgeneralservicelist.org

It is our hope that this list will be of use to you and your students.
Please join the discussion on the NGSL as we begin to present on it
at academic conferences throughout the year such as KOTESOL and
the World Congress on Extensive Reading in Korea, JALT-CALL, and
JALT National in Japan, the Vocab@Voc Conference in New Zealand,
and the AILA Conference in Australia in mid 2014. Later this year
you will also be able to find the NGSL taught in a new course from
Cambridge University Press, In Focus.
Bibliography
West, M. (1953). A General Service List of English Words. London:
Longman, Green & Co.
Bauer, L., & Nation, I. S. P. (1993). Word Families. International
Journal of Lexicography, 6(4), 253279.
(this paper is a modified version of the article titled, The New
General Service List: Celebrating 60 years of Vocabulary Learning
published by Browne, C. in the July 2013 issue of JALTs The
Language Teacher)