Вы находитесь на странице: 1из 39

Linguistic evidence within and across languages, word frequency lists and language learning

Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Universities of Leeds and Sussex

Linguistic evidence within and across languages, word frequency lists and language learning

Or Word lists are useful, but are they (could they be) scientific?
2

KELLY

EU lifelong learning project Goal: wordcards


Word in one lg on one side, other on other Language learning Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden (Leeds does Arabic Chinese Russian)
April 2010

9 languages, 36 pairs

Partners (incl Leeds) in 6 countries

Kilgarriff: KELLY 3

Leeds

Method

Prepare monolingual lists Translate


Each into 8 target languages Professional translation services

Integrate, finalise Produce cards Goal for each set

9000 pairs at 6 levels


April 2010

Leeds

Kilgarriff: KELLY

(Monolingual) Word Lists


Define a syllabus Which words get used in


Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing

NS: educational psychologists NNS: proficiency levels

Leeds

April 2010

Kilgarriff: KELLY

Should be corpus-based

Most aren't

Corpora are quite new

Easy to do better People will use them

Maybe also Governments

Leeds

April 2010

Kilgarriff: KELLY

How

Take your corpus Count Voila

Leeds

April 2010

Kilgarriff: KELLY

Complications

What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy

All are slightly different issues for each lg


April 2010

Leeds

Kilgarriff: KELLY

What is a word; delimiters

Found between spaces

Not for Chinese: segmentation co-operate, widely-held, farmer's, can't Compounding, separable verbs Clitics, al, ...
April 2010

English

Norwegian, Swedish

Arabic, Italian

...
Kilgarriff: KELLY 9

Leeds

Words and lemmas

Word form (in text)

invading

Lemma (dictionary headword)


Invade for forms invade invades invaded invading

Lemmatisation

Leeds

Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara
April 2010

Kilgarriff: KELLY

10

Grammatical classes

brush (verb) and brush (noun)


Same item or different? Proposal: lempos With trepidation

Recommendation: different

Chinese: weak sense of noun, verb

Required

(short) list of word classes for each lg

Same for all unless good reason


Kilgarriff: KELLY 11

Leeds

April 2010

Marginal cases

Numbers

twelve, seventeenth, fifties Days of week, months Capitals, nationalities, currencies, adjectives, languages easter, christmas, islam, republican
Kilgarriff: KELLY 12

Closed sets

Countries

regional/dialects, political groups, religions

Consistency before freq: policies needed


April 2010

Leeds

Multiwords

According to

Linguistically a word but

Multiword frequency list: top item of the

Can't use freqs (alone) to select multiwords

Base list:

Recommendation: no multiwords But see below


April 2010

Leeds

Kilgarriff: KELLY

13

Homonymy

bank (river) and bank (money) Word sense disambiguation


We can't do (with decent accuracy) We can't give freqs for senses Sometimes disconcerting See also below

Lists of words not meanings


Leeds

April 2010

Kilgarriff: KELLY

14

Corpora

A fairly arbitrary sample of a lg To limit arbitrariness of wdlist

Make it big and diverse From web Can do for any language Web language: less formal

WACKY corpora

Leeds

April 2010

not mainly 'reporting' or fiction, cf news, BNC Good for lg learners Kilgarriff: KELLY

15

Comparing corpora

Corpora: new We are all beginners Best way to get sense of a corpus

Compare with another Keywords of each vs. other

Case studies Sketch Engine functions


Kilgarriff: KELLY 16

Leeds

April 2010

Comparing frequency lists


Web1T Present from google All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (1012) words of English thats 1,000,000,000,000 Compare with BNC Take top 50,000 items of each 105 Web1T words not in BNC top50k 50 words with highest Web1T:BNC ratio 50 words with lowest ratio
Kilgarriff: KELLY 17

Leeds

April 2010

Web-high (155 terms)


61 web and computing
config browser spyware url www forum

38 porn 22 US English (incl Spanish influence los) 18 business/products common on web


poker viagra lingerie ringtone dvd casino rental collectible tiffany NB: BNC is old

4 legal
trademarks pursuant accordance herein
Leeds April 2010

Kilgarriff: KELLY

18

Web-low
Exclude British English, transcription/tokenisation anomalies
herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Leeds

April 2010

Kilgarriff: KELLY

19

Observations
Pronouns and past tense verbs Fiction Masc vs fem Yesterday Probably daily newspapers Constancy of ratios: He/him/himself She/her/herself

Leeds

April 2010

Kilgarriff: KELLY

20

Corpus Factory

Many languages General corpus, 100m+ words Fast High quality Comparable across languages

Leeds

April 2010

Kilgarriff: KELLY

Gather Seed words

Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to come Extract text from Wiki. Wikipedia 2 Text

Tokenise the text. Morphology of the language is important Can use the existing word tokeniser tools.
April 2010

Leeds

Kilgarriff: KELLY

Web Corpus Statistics


Unique URLs After After deWeb corpus size Words collected filtering duplication MB 97,584 22,424 19,708 739 MB 108.6 m 71,613 20,051 13,321 424 MB 30.6 m 37,864 6,178 5,131 107 MB 3.4 m 120,314 23,320 20,998 1.2 GB 81.8 m 106,076 27,728 19,646 1.2 GB 149 m

Dutch Hindi Telugu Thai Vietnamese

Leeds

April 2010

Kilgarriff: KELLY

Evaluation

For each of the languages, two corpora available: Web and Wiki Dutch: also a carefullydesigned lexicographic corpus.

Hypothesis: Wiki corpora are informational Informational --> typical written Interactional --> typical spoken

Leeds

April 2010

Kilgarriff: KELLY

Evaluation

1st, 2nd person pronouns

strong indicators of interactional language. English: I me my mine you your yours we us our Ratio: web:wiki

For each languages

Leeds

April 2010

Kilgarriff: KELLY

Results
Thai Total Word Web 2935 133 770 1722 2390 21 434 2108 179 431 11123 Wiki 366 19 97 320 855 6 66 2070 148 677 4624 Ratio 8.00 7.00 7.87 5.36 2.79 3.20 6.54 1.01 1.20 0.63 2.40

Table : 1st and 2nd person pronouns in Web and Wiki corpora per million words
Kilgarriff: KELLY

Leeds

April 2010

Theme Belgian

ANW Word Brussel Belgische Vlaamse

NlWaC English gloss (city) Belgian Flemish Looked/watched previous watched/looked Percent million billion (Belgian) Franc
Kilgarriff: KELLY

Theme Religion

Word God Jezus Christus Gods http

English gloss

Fiction

keek vorig kreek procent miljoen miljard frank


April 2010

Web

Geplaatst Nl Bewerk Reacties www

posted (Web domain) edited Replies

Newspapers
Leeds

Stages

Sort out corpora, tagging Automatically generate M1 lists


names, numbers, countries ... keywords vis-a-vis other corpora

Review, prepare M2 lists Translate

Leeds

April 2010

Kilgarriff: KELLY

28

review - how?

points system

2 points for each of 6 levels 12 points for most freq words

deduct points for words in overrepresented areas add in words from other corpora

Leeds

April 2010

Kilgarriff: KELLY

29

Translation database

On the web All translations entered into it Queries like

All Swedish words used as translations more than six times All 1:1:1:1... 'simple cases'

Leeds

April 2010

Kilgarriff: KELLY

30

Translations

Usually, of texts Words in context Kelly: no context

Usual principles don't apply Instructions to translators

Leeds

April 2010

Kilgarriff: KELLY

31

Using the database

Find words not in M2 lists, that need adding


Multiwords English look for Probably, the translation of a high-freq word in several of the 8 other lgs So:

add it to English list

Homonyms: could be similar


April 2010

Leeds

Kilgarriff: KELLY

32

Monolingual master lists (M3)


Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs

Useful words which might not be hi-freq

added words/multiwords must be above a lower freq threshold

Target 9000 Important contribution


April 2010

Leeds

Kilgarriff: KELLY

33

Numbers

Target: 9000 per list M2 lists


Estimate: 5000-6000 needed We add 3000-4000 multiwords and other 'back-translations'

Leeds

April 2010

Kilgarriff: KELLY

34

From M3 lists to T2 lists

Leeds

April 2010

Kilgarriff: KELLY

35

Current status

M1 lists prepared Lists checked, compared with other lists

Corpus-based and other

M2 lists prepared Translation underway

Leeds

April 2010

Kilgarriff: KELLY

36

Big problems

Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello

Worse than anticipated Lists from spoken corpora, learner corpora, needed Relation between

Competence for communicating The corpora at our disposal


Kilgarriff: KELLY 37

Leeds

April 2010

Word lists are useful, but

...are they scientific?

A tiny bit, occasionally Yes

...could they be scientific?

article of faith

By the end of KELLY, we'll have a clearer idea how

Leeds

April 2010

Kilgarriff: KELLY

38

http://forbetterenglish.com

Leeds

April 2010

Kilgarriff: KELLY

39

Вам также может понравиться